fix: send FAIL ack when worker is terminated

opt-elixir / faktory_worker

Elixir Faktory worker https://hexdocs.pm/faktory_worker

MIT License

38 stars 12 forks source link

fix: send FAIL ack when worker is terminated #186

Closed superhawk610 closed 1 year ago

superhawk610 commented 1 year ago

Whenever a worker is terminated while running a job (when the application itself is terminated via SIGTERM, for example), the job is left in a pending state until reserve_timeout is met. The job should instead be ack'd as failed so that another worker can pick it back up, since we know it's never going to complete.

This should fix the issue in continuous deployment environments when a deployment is terminated and replaced by a newer instance where any in-progress jobs are left hanging.

jeremyowensboggs commented 1 year ago

Do you think we should try to kill the worker(and also checking that it hasn't finished) before officially calling the cancel?

superhawk610 commented 1 year ago

Do you think we should try to kill the worker(and also checking that it hasn't finished) before officially calling the cancel?

@jeremyowensboggs once the worker finishes, it should send a job_ref or :DOWN message to Server since it's trapping exits. What do you think about something like this?

receive do
  {^job_ref, _} -> Worker.ack_job(state, :ok)
  {:DOWN, _, :process, _, reason} -> Worker.ack_job(state, {:error, reason})
after
  0 ->
    # not sure if it's worth having a timeout or using `:brutal_kill` here?
    case Task.shutdown(task, 1_000) do
      {:ok, _} -> Worker.ack_job(state, :ok)
      _ -> Worker.ack_job(state, {:error, "Worker Terminated"})
    end
end

jeremyowensboggs commented 1 year ago

Shutdown will check for a reply for you shutdown. (I can't find the example I have in mind ... looking)

jeremyowensboggs commented 1 year ago

This is close to the example I was thinking of - obviously, no need to yield in this case https://github.com/elixir-lang/elixir/blob/main/lib/mix/lib/mix/utils.ex#L583

case Task.yield(task, timeout) || Task.shutdown(task, :brutal_kill) do
    {:ok, result} -> result
    _ -> {:remote, "request timed out after #{timeout}ms"}
  end

Which is pretty much what you have, just no need to do the top receive