Closed superhawk610 closed 1 year ago
Do you think we should try to kill the worker(and also checking that it hasn't finished) before officially calling the cancel?
Do you think we should try to kill the worker(and also checking that it hasn't finished) before officially calling the cancel?
@jeremyowensboggs once the worker finishes, it should send a job_ref
or :DOWN
message to Server
since it's trapping exits. What do you think about something like this?
receive do
{^job_ref, _} -> Worker.ack_job(state, :ok)
{:DOWN, _, :process, _, reason} -> Worker.ack_job(state, {:error, reason})
after
0 ->
# not sure if it's worth having a timeout or using `:brutal_kill` here?
case Task.shutdown(task, 1_000) do
{:ok, _} -> Worker.ack_job(state, :ok)
_ -> Worker.ack_job(state, {:error, "Worker Terminated"})
end
end
Shutdown will check for a reply for you shutdown. (I can't find the example I have in mind ... looking)
This is close to the example I was thinking of - obviously, no need to yield in this case https://github.com/elixir-lang/elixir/blob/main/lib/mix/lib/mix/utils.ex#L583
case Task.yield(task, timeout) || Task.shutdown(task, :brutal_kill) do
{:ok, result} -> result
_ -> {:remote, "request timed out after #{timeout}ms"}
end
Which is pretty much what you have, just no need to do the top receive
Whenever a worker is terminated while running a job (when the application itself is terminated via
SIGTERM
, for example), the job is left in a pending state untilreserve_timeout
is met. The job should instead be ack'd as failed so that another worker can pick it back up, since we know it's never going to complete.This should fix the issue in continuous deployment environments when a deployment is terminated and replaced by a newer instance where any in-progress jobs are left hanging.