oban-bg / oban

💎 Robust job processing in Elixir, backed by modern PostgreSQL and SQLite3
https://oban.pro
Apache License 2.0
3.37k stars 314 forks source link

Oban Pro Batch Workers stuck in "executing" forever after upgrade from Aurora PostgreSQL 14 -> 16 #1124

Closed williamge closed 4 months ago

williamge commented 4 months ago

Precheck

Environment

Current Behavior

Inserting a batch worker after upgrading to Aurora PSQL 16.2 causes the job to be stuck in "executing" for a prolonged period of time, never completing or failing, and eventually getting stuck in the "scheduled" state. No errors are recorded for the job, the attempt is shown as attempt 1 of 5.

Additionally, after inserting the batch worker, other jobs (non-batch) in that queue will also get stuck in the "available" state, never actually executing.

When looking at error logs, we see repeated instances of the same error:

GenServer {Oban.Registry, {Oban, {:producer, "default"}}} terminating
** (Postgrex.Error) ERROR 42P18 (indeterminate_datatype) could not determine data type of parameter $7

    query: SELECT $1 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $2) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $3) UNION ALL (SELECT $4 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $5) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $6) LIMIT 1) UNION ALL (SELECT $7 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $8) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $9) LIMIT 1) UNION ALL (SELECT $10 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $11) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $12) LIMIT 1) UNION ALL (SELECT $13 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $14) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $15) LIMIT 1) UNION ALL (SELECT $16 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $17) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $18) LIMIT 1) LIMIT 1
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1054: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:952: Ecto.Adapters.SQL.execute/6
    (ecto 3.11.2) lib/ecto/repo/queryable.ex:232: Ecto.Repo.Queryable.execute/4
    (ecto 3.11.2) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    (oban 2.17.12) lib/oban/repo.ex:156: Oban.Repo.with_dynamic_repo/3
    (oban_pro 1.4.10) lib/oban/pro/batcher.ex:113: anonymous fn/3 in Oban.Pro.Batcher.states_for_callbacks/3
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1358: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.6.0) lib/db_connection.ex:1020: DBConnection.transaction/3
    (oban 2.17.12) lib/oban/repo.ex:156: Oban.Repo.with_dynamic_repo/3
    (oban_pro 1.4.10) lib/oban/pro/batcher.ex:64: Oban.Pro.Batcher.handle/3
    (elixir 1.16.1) lib/enum.ex:987: Enum."-each/2-lists^foreach/1-0-"/2
    (oban_pro 1.4.10) lib/oban/pro/engines/smart.ex:861: Oban.Pro.Engines.Smart.run_flush_handlers/2
    (ecto 3.11.2) lib/ecto/multi.ex:883: Ecto.Multi.apply_operation/5
    (elixir 1.16.1) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
    (ecto 3.11.2) lib/ecto/multi.ex:856: anonymous fn/5 in Ecto.Multi.apply_operations/5
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1358: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.6.0) lib/db_connection.ex:1710: DBConnection.run_transaction/4
    (ecto 3.11.2) lib/ecto/repo/transaction.ex:18: Ecto.Repo.Transaction.transaction/4
    (oban 2.17.12) lib/oban/repo.ex:156: Oban.Repo.with_dynamic_repo/3
    (oban_pro 1.4.10) lib/oban/pro/engines/smart.ex:536: anonymous fn/4 in Oban.Pro.Engines.Smart.fetch_jobs/3

This is a very similar error as this previous post in the Elixir/Oban forums: https://elixirforum.com/t/error-when-enqueuing-oban-jobs-with-unicity-constraint-on-aurora-postgresql-16-1-postgrex-error-error-42p18-indeterminate-datatype-could-not-determine-data-type-of-parameter-7/63996

We previously experienced that other bug after this 14->16 upgrade, raising the error immediately while inserting those jobs. That issue was resolved after the Oban Pro update to 1.4.10, but this issue has popped up after that update separately, and seems to happen in the background.

Expected Behavior

Inserting a batch worker should work, the worker should execute, the queue it was in should not form a logjam, and no errors from Oban internals should be thrown.

sorentwo commented 4 months ago

Something changed in how PostgreSQL 16.2 treats injected values in a union. This is fixed with explicit casting for all union queries in main, and we'll backport to v1.4

williamge commented 3 months ago

Glad to hear it's been fixed, and likely the whole thing will be improved in 1.5 👍 .

Is there a rough timeline on a 1.4 backport? I've been able to work around this for now, but it's been a tricky issue with batch workers in our dev environment.

sorentwo commented 3 months ago

@williamge We'll have a new v1.5.0 RC and a v1.4 patch released this week.

williamge commented 3 months ago

:) awesome thanks a lot!