rails / solid_queue

Database-backed Active Job backend
MIT License
1.87k stars 120 forks source link

Persistent "Key (job_id) already exists" error when claiming jobs #268

Closed asndiallo closed 2 months ago

asndiallo commented 2 months ago

Description

I'm encountering a persistent error when SolidQueue attempts to claim jobs. The error occurs even after cleaning up stale jobs and trying to create completely different jobs.

Error Message

DETAIL:  Key (job_id)=(<the_job_id>) already exists.

    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/postgresql_adapter.rb:894:in `block (2 levels) in exec_no_cache'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract_adapter.rb:1028:in `block in with_raw_connection'
        ...
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/database_statements.rb:171:in `exec_insert_all'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/query_cache.rb:25:in `exec_insert_all'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/insert_all.rb:47:in `execute'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/persistence.rb:243:in `insert_all!'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/claimed_execution.rb:18:in `claiming'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:39:in `lock_candidates'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:28:in `block in select_and_lock'
        ...

Environment

Additional Information

Attempted Solutions

Questions

  1. Could this be related to a race condition in the job claiming process?
  2. Are there any known issues with UUID job IDs in Solid Queue?
  3. Are there any additional debugging steps you'd recommend?

Complete logs

DETAIL:  Key (job_id)=(afa0c46c-e2c5-4afe-9c4e-15b4fc024dfc) already exists.

    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/postgresql_adapter.rb:894:in `block (2 levels) in exec_no_cache'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract_adapter.rb:1028:in `block in with_raw_connection'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract_adapter.rb:1000:in `with_raw_connection'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/postgresql_adapter.rb:893:in `block in exec_no_cache'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/notifications/instrumenter.rb:58:in `instrument'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract_adapter.rb:1143:in `log'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/postgresql_adapter.rb:892:in `exec_no_cache'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/postgresql_adapter.rb:872:in `execute_and_clear'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/postgresql/database_statements.rb:64:in `internal_exec_query'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/database_statements.rb:171:in `exec_insert_all'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/query_cache.rb:25:in `exec_insert_all'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/insert_all.rb:47:in `execute'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/persistence.rb:243:in `insert_all!'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/claimed_execution.rb:18:in `claiming'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:39:in `lock_candidates'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:28:in `block in select_and_lock'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/transaction.rb:535:in `block in within_new_transaction'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/transaction.rb:532:in `within_new_transaction'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/database_statements.rb:344:in `transaction'
    from /usr/local/bundle/ruby/3.3.0/gems/activerecord-7.1.3.4/lib/active_record/transactions.rb:212:in `transaction'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:26:in `select_and_lock'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:12:in `block in claim'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:11:in `each'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:11:in `flat_map'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/app/models/solid_queue/ready_execution.rb:11:in `claim'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/worker.rb:34:in `block in claim_executions'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/poller.rb:48:in `block in with_polling_volume'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/logger_silence.rb:18:in `block in silence'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/logger_thread_safe_level.rb:45:in `log_at'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/logger_silence.rb:18:in `silence'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/poller.rb:48:in `with_polling_volume'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/worker.rb:33:in `claim_executions'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/worker.rb:23:in `poll'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/poller.rb:31:in `block (2 levels) in start_loop'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/execution_wrapper.rb:92:in `wrap'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/app_executor.rb:7:in `wrap_in_app_executor'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/poller.rb:30:in `block in start_loop'
    from <internal:kernel>:187:in `loop'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/poller.rb:27:in `start_loop'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/poller.rb:22:in `run'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/processes/runnable.rb:16:in `start'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:132:in `block in start_fork'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:131:in `fork'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:131:in `start_fork'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:162:in `block in replace_fork'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/notifications.rb:206:in `block in instrument'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/notifications/instrumenter.rb:58:in `instrument'
    from /usr/local/bundle/ruby/3.3.0/gems/activesupport-7.1.3.4/lib/active_support/notifications.rb:206:in `instrument'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue.rb:59:in `instrument'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:159:in `replace_fork'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:143:in `block in reap_and_replace_terminated_forks'
    from <internal:kernel>:187:in `loop'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:139:in `reap_and_replace_terminated_forks'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:50:in `block in supervise'
    from <internal:kernel>:187:in `loop'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:46:in `supervise'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:27:in `start'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/supervisor.rb:12:in `start'
    from /usr/local/bundle/ruby/3.3.0/gems/solid_queue-0.3.4/lib/solid_queue/tasks.rb:4:in `block (2 levels) in <main>'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:281:in `block in execute'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:281:in `each'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:281:in `execute'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:219:in `block in invoke_with_call_chain'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:199:in `synchronize'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:199:in `invoke_with_call_chain'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/task.rb:188:in `invoke'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:188:in `invoke_task'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:138:in `block (2 levels) in top_level'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:138:in `each'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:138:in `block in top_level'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:147:in `run_with_threads'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:132:in `top_level'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:83:in `block in run'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:214:in `standard_exception_handling'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/lib/rake/application.rb:80:in `run'
    from /usr/local/bundle/ruby/3.3.0/gems/rake-13.2.1/exe/rake:27:in `<top (required)>'
    from /usr/local/bundle/ruby/3.3.0/bin/rake:25:in `load'
    from /usr/local/bundle/ruby/3.3.0/bin/rake:25:in `<top (required)>'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/cli/exec.rb:58:in `load'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/cli/exec.rb:58:in `kernel_load'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/cli/exec.rb:23:in `run'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/cli.rb:451:in `exec'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/command.rb:28:in `run'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor.rb:527:in `dispatch'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/cli.rb:34:in `dispatch'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/base.rb:584:in `start'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/cli.rb:28:in `start'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/exe/bundle:28:in `block in <top (required)>'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors'
    from /usr/local/bundle/ruby/3.3.0/gems/bundler-2.5.5/exe/bundle:20:in `<top (required)>'
    from /usr/local/bundle/bin/bundle:25:in `load'
    from /usr/local/bundle/bin/bundle:25:in `<main>'
I, [2024-07-27T20:19:43.384580 #1]  INFO -- : SolidQueue-0.3.4 Replaced terminated Worker (89.1ms)  pid: 2181, status: 1, pid_from_status: 2181, signaled: false, stopsig: nil, termsig: nil, hostname: "<hostname_ip>-c8e160be6658"
I, [2024-07-27T20:19:43.477333 #2188]  INFO -- : SolidQueue-0.3.4 Register Worker (88.0ms)  pid: 2188, hostname: "<hostname_ip>-c8e160be6658"
I, [2024-07-27T20:19:43.479008 #2188]  INFO -- : SolidQueue-0.3.4 Started Worker (91.6ms)  pid: 2188, hostname: "<hostname_ip>-c8e160be6658", polling_interval: 0.1, queues: "*", thread_pool_size: 5
rosa commented 2 months ago

Huh, this one is quite strange! I haven't heard of this one happening before 😕 It must be related to using UUID for sure but so weird.

Could this be related to a race condition in the job claiming process?

Hmm... if this was the case, it'd happen in a similar way when using auto-increment PKs. I think there shouldn't be a race condition because only job IDs that are locked can be converted into claimed executions, so two workers for the same queue would never try to claim the same jobs because they wouldn't be able to lock them.

Are there any known issues with UUID job IDs in Solid Queue?

This is the first time I heard about it but I'm not sure if anyone is using UUID job IDs.

Could you let me know how the UUIDs are being assigned? Using UUIDs as PK wasn't on my mind at all, so the migrations don't use that. Did you edit the migrations after installing them and before running them?

asndiallo commented 2 months ago

Thank you for your response. I appreciate your insights on this unusual issue. To answer your questions:

  1. Regarding UUID assignment: Before running the migrations, I modified the migration file to use UUIDs for all tables. Here's an example of how I changed the table creation:

    create_table :solid_queue_jobs, id: :uuid do |t|
         # ... rest of the table definition
    end

    I applied this change to all Solid Queue tables (jobs, scheduled_executions, ready_executions, claimed_executions, blocked_executions, failed_executions, pauses, processes, and semaphores).

  2. UUID Generation:

Could you let me know how the UUIDs are being assigned?

I'm using the pgcrypto extension in my PostgreSQL database for UUID generation. This is set up in my database configuration and is used consistently across all tables in my application, not just for Solid Queue tables. I wanted to ensure consistency across the database and to leverage the benefits of UUIDs.

Given this information, do you think the UUID usage could be interfering with Solid Queue's job claiming process in some unexpected way? Are there any parts of Solid Queue that might assume integer-based primary keys or rely on their sequential nature?

asndiallo commented 2 months ago

Additional important information:

  1. Temporary fix: I manually cleaned the problematic jobs and their associated executions, which resolved the immediate issue.
  2. Current status: The server is running properly after this cleanup.
  3. Error handling: I've added handling for ActiveRecord::RecordNotUnique exceptions in my ApplicationJob class.
  4. Recurring problem: This is a known issue that has occurred multiple times before.

Despite these measures, I'm concerned about future recurrences and am seeking a permanent solution. Do you have any suggestions for preventing this issue or for implementing a more robust job claiming process that can handle potential UUID conflicts?

rosa commented 2 months ago

Thanks a lot for the extra information, that's very helpful.

Are there any parts of Solid Queue that might assume integer-based primary keys or rely on their sequential nature?

No, none at all... 🤔 In the cases where we need to order by job_id, job_id is explicitly included in the used index, and sorting on that is generally done to avoid deadlocks. With UUDIs, it should work exactly in the same way. I'm wondering... are the foreign keys in the execution tables also correctly created as UUID? I imagine yes, but asking just in case. I mean for example in this case

    create_table :solid_queue_claimed_executions do |t|
      t.references :job, index: { unique: true }, null: false
      t.bigint :process_id
      t.datetime :created_at, null: false

      t.index [ :process_id, :job_id ]
    end

is the resulting table using uuid for job_id? Or should you also add this:

    create_table :solid_queue_claimed_executions do |t|
      t.references :job, index: { unique: true }, null: false, type: :uuid
...
    end

Moreover, process_id is defined as bigint there, but it should be changed to uuid as well because it'll include the ID from a record in the solid_queue_process table 🤔

asndiallo commented 2 months ago

are the foreign keys in the execution tables also correctly created as UUID?

Yes all foreign keys are also set to uuid.

Moreover, process_id is defined as bigint there, but it should be changed to uuid as well because it'll include the ID from a record in the solid_queue_process table 🤔

You're absolutely right. I overlooked changing process_id to uuid

Here's my current migration for the solid_queue_claimed_executions table:

create_table :solid_queue_claimed_executions, id: :uuid do |t|
  t.references :job, index: { unique: true }, null: false, type: :uuid
  t.bigint :process_id
  t.datetime :created_at, null: false

  t.index [ :process_id, :job_id ]
end

I'll create a new migration to change process_id as well as supervisor_id in solid_queue_processes to UUID.

But do you think this inconsistency could potentially be the source of the "Key (job_id) already exists" error I've been experiencing?

rosa commented 2 months ago

But do you think this inconsistency could potentially be the source of the "Key (job_id) already exists" error I've been experiencing?

I'm not sure; I think it shouldn't, but it would quite possibly interfere with claiming jobs because process_id is used by the worker to "mark" claimed executions as theirs, but I'd expected this to surface on a different line than the one you're getting the error on. It seems your error is happening on the line marked with * here:

      job_data = Array(job_ids).collect { |job_id| { job_id: job_id, process_id: process_id } }

      SolidQueue.instrument(:claim, process_id: process_id, job_ids: job_ids) do |payload|
*      insert_all!(job_data)
        where(job_id: job_ids, process_id: process_id).load.tap do |claimed|
          block.call(claimed)

I'd expect the process_id wrong type to cause problems right after that 🤔

It could also have caused issues with releasing claimed executions after a worker is shutdown, but that would just have left jobs in claimed state without being run, but not cause duplicated jobs there, because the jobs would have been claimed, they wouldn't have been "claimable".

Let me know how it goes after that migration. If you hit the error again, could you keep the problematic jobs and their executions? I'd like to inspect them to see if that has a clue about where the problem actually is.

rosa commented 2 months ago

Hey @asndiallo, did you hit the issue again after running that migration for process_id?

asndiallo commented 2 months ago

Hi @rosa I haven't encountered the issue ever since. Maybe the inconsistant column types were the root cause of this. I'm going to go ahead and close this issue. Thank you so much for your assistance :)

rosa commented 2 months ago

Awesome! Thanks a lot 🙏