Job executing but stuck forever

narrowtux commented 6 days ago

Environment

Oban versions:

      {:oban, "~> 2.17"},
      {:oban_pro, "~> 1.4.9", repo: "oban"},
      {:oban_web, "~> 2.10.4", repo: "oban"},

Elixir version: 1.14.4-erlang-25.3-alpine-3.15.7

Postgres version: 12.?

Current Behavior

Sometimes, jobs are created that run forever:

Which jobs get stuck in this way seems random, it's not always the same worker.

DynamicLifeline plugin does nothing, I guess because the node and queue the job runs on haven't actually terminated.

I can't say if the process that should run the job is still alive, since I see no way to resolve a job ID to a PID. Maybe I can provide more debug information if I know how.

Exemplary job struct and queue info

%Oban.Job{
    __meta__: #Ecto.Schema.Metadata<:loaded, "oban_jobs">,
    id: 32514724,
    state: "executing",
    queue: "cronjobs",
    worker: "Platform.Jobs.CallMf",
    args: %{
      "function" => "update_cron",
      "module" => "Elixir.Platform.Device.ReadingStats"
    },
    meta: %{
      "cron" => true,
      "cron_expr" => "*/5 * * * *",
      "uniq_key" => 32329759
    },
    tags: [],
    errors: [],
    attempt: 1,
    attempted_by: ["element_iot@f983671573fb",
     "01905988-a2dd-7495-a08c-0a58e3c09f21"],
    max_attempts: 20,
    priority: 0,
    attempted_at: ~U[2024-06-28 07:10:01.035505Z],
    cancelled_at: nil,
    completed_at: nil,
    discarded_at: nil,
    inserted_at: ~U[2024-06-28 07:10:00.994813Z],
    scheduled_at: ~U[2024-06-28 07:10:00.994813Z],
    conf: nil,
    conflict?: false,
    replace: nil,
    unique: nil,
    unsaved_error: nil
  }

Oban.check_queue(:cronjobs) returned:

%{
  global_limit: %Oban.Pro.Producer.Meta.GlobalLimit{
    allowed: 2,
    tracked: %{"28812996" => %{"args" => nil, "count" => 1, "worker" => nil}},
    partition: nil
  },
  local_limit: 2,
  name: "Oban",
  node: "element_iot@f983671573fb",
  paused: false,
  queue: "cronjobs",
  rate_limit: nil,
  retry_attempts: 5,
  retry_backoff: 1000,
  running: [32514724],
  started_at: ~U[2024-06-27 11:50:45.213426Z],
  updated_at: ~U[2024-07-02 07:54:04.811037Z],
  uuid: "01905988-a2dd-7495-a08c-0a58e3c09f21"
}

Workaround

Manually identify the jobs that are stuck, cancel them and then retry. I see no way to do this automatically, since it's not apparent from the job struct if it's stuck.

narrowtux commented 6 days ago

I don't want to exclude a user error on my behalf by the way. I think if I could only know the PID of the job, I could have a look at its stack trace if it's actually still running.

narrowtux commented 5 days ago

nvm, it's probably our own application code.

For future reference, you can find all processes that are currently running an oban job by searching for Oban.Queue.Executor in live_dashboard

sorentwo commented 5 days ago

For future reference, you can find all processes that are currently running an oban job by searching for Oban.Queue.Executor in live_dashboard

As of the next Oban release, when running with OTP 27/Elixir 1.17 you'll also have process labels to tell you which worker each PID is too.

narrowtux commented 5 days ago

Sounds great!

sorentwo / oban