saleyn / erlexec

Execute and control OS processes from Erlang/OTP
https://hexdocs.pm/erlexec/readme.html
Other
525 stars 139 forks source link

custom kill command is ignored when cleaning up on beam termination #138

Closed sasa1977 closed 3 years ago

sasa1977 commented 3 years ago

I'm trying to start a docker container via erlexec, and I'd like to ensure proper cleanup of the container. To make this work, I'm playing with the following approach:

  1. Use some predefined container name
  2. Provide a kill cmd which forcefully kills the container via its name

Here's an elixir example:

:exec.run_link(
  "docker run --rm -t --name foo busybox",
  kill: "docker rm -f foo"
)

If I stop such process with :exec.stop/1 the container is correctly removed. However, if BEAM is terminated forcefully (e.g. by hitting ctrl-c twice), the container remains alive. This leads me to the conclusion that custom kill command is ignored when erlexec performs post-beam cleanup. Would this assumption be correct?

saleyn commented 3 years ago

The design of erlexec was implying that if the beam is killed, then the exec-port process would detect the pipe closing, and would kill all of its jobs. The last time I tested this on Linux a few years back it worked without issues. If it doesn't work with docker for some reason you'd need to troubleshoot (setting the debug and verbose modes, and trying to see where the issue is). I don't think it's related to the custom kill command.

sasa1977 commented 3 years ago

The design of erlexec was implying that if the beam is killed, then the exec-port process would detect the pipe closing, and would kill all of its jobs.

This definitely works. I relied on it in the past, and I also double checked that it still works today.

I don't think it's related to the custom kill command.

My impression is that custom kill command is not executed if beam is killed. Here's a simple demo in Elixir:

{:ok, pid, _ospid} = :exec.run(
  """
  function cleanup {
    echo "got exit signal" >> debug.txt
  }

  trap cleanup 0
  sleep infinity
  """,
  kill: ~s/echo "custom kill command" >> debug.txt/
)

If I invoke this in iex shell, and then invoke :exec.stop(pid), debug.txt will contain custom kill command. However, if I start the program and kill the shell by hitting ctrl+c twice, then debug.txt will contain only got exit signal (make sure to remove the file before restarting the experiment).

This indicates that erlexec ignores the custom kill command when cleaning up after beam is killed, and instead always sends an exit signal.

saleyn commented 3 years ago

A custom kill command only gets executed when a child process is not getting terminated by SIGTERM. Only in that case exec-port will execute the custom kill command. So in your example above the log contains got exit signal, which is indicative that the child process got killed and there was no need to execute the custom kill command.

sasa1977 commented 3 years ago

I see. My question is then how can I ensure that custom kill command is immediately executed in the following situations:

  1. When the beam stops (normally or abnormally)
  2. When the linked process crashes
  3. When manually stopping the OS process with stop (from what I can tell this already works today)

I tried playing with kill_timeout, but that doesn't seem to do the trick. I basically need some way of instructing erlexec to immediately move on to the custom command without even trying with SIGTERM. Or perhaps, alternatively, I'd like to provide a custom terminate (not kill) command which would be used instead of SIGTERM. Is any of that possible today?

saleyn commented 3 years ago

In the current implementation the kill command is the fallback for the failed SIGTERM. It seems to me that you can solve your problem simply by writing a wrapper script for your child task, which traps SIGTERM, and executes the custom command you need.

You are also welcome to submit a patch that adds an option to force the kill command without SIGTERM.

sasa1977 commented 3 years ago

It seems to me that you can solve your problem simply by writing a wrapper script for your child task, which traps SIGTERM, and executes the custom command you need.

Yeah this is the approach I'm currently using, although it feels slightly hacky.

I'll see if I can submit a patch, but I need to dust off my C++ first, so I can't commit to any deadline :-)

saleyn commented 3 years ago

I believe that the latest commit fixed this issue.

sasa1977 commented 3 years ago

Cool! I forgot about this issue, b/c I went down the different path and rolled my own small port command wrapper. So regardless of whether this is solved or not, feel free to close this issue, b/c I personally won't find the time to work on it.