Closed asndiallo closed 2 months ago
Huh, this one is quite strange! I haven't heard of this one happening before 😕 It must be related to using UUID for sure but so weird.
Could this be related to a race condition in the job claiming process?
Hmm... if this was the case, it'd happen in a similar way when using auto-increment PKs. I think there shouldn't be a race condition because only job IDs that are locked can be converted into claimed executions, so two workers for the same queue would never try to claim the same jobs because they wouldn't be able to lock them.
Are there any known issues with UUID job IDs in Solid Queue?
This is the first time I heard about it but I'm not sure if anyone is using UUID job IDs.
Could you let me know how the UUIDs are being assigned? Using UUIDs as PK wasn't on my mind at all, so the migrations don't use that. Did you edit the migrations after installing them and before running them?
Thank you for your response. I appreciate your insights on this unusual issue. To answer your questions:
Regarding UUID assignment: Before running the migrations, I modified the migration file to use UUIDs for all tables. Here's an example of how I changed the table creation:
create_table :solid_queue_jobs, id: :uuid do |t|
# ... rest of the table definition
end
I applied this change to all Solid Queue tables (jobs
, scheduled_executions
, ready_executions
, claimed_executions
, blocked_executions
, failed_executions
, pauses
, processes
, and semaphores
).
UUID Generation:
Could you let me know how the UUIDs are being assigned?
I'm using the pgcrypto
extension in my PostgreSQL database for UUID generation. This is set up in my database configuration and is used consistently across all tables in my application, not just for Solid Queue tables. I wanted to ensure consistency across the database and to leverage the benefits of UUIDs.
Given this information, do you think the UUID usage could be interfering with Solid Queue's job claiming process in some unexpected way? Are there any parts of Solid Queue that might assume integer-based primary keys or rely on their sequential nature?
Additional important information:
ActiveRecord::RecordNotUnique
exceptions in my ApplicationJob
class.Despite these measures, I'm concerned about future recurrences and am seeking a permanent solution. Do you have any suggestions for preventing this issue or for implementing a more robust job claiming process that can handle potential UUID conflicts?
Thanks a lot for the extra information, that's very helpful.
Are there any parts of Solid Queue that might assume integer-based primary keys or rely on their sequential nature?
No, none at all... 🤔 In the cases where we need to order by job_id
, job_id
is explicitly included in the used index, and sorting on that is generally done to avoid deadlocks. With UUDIs, it should work exactly in the same way. I'm wondering... are the foreign keys in the execution tables also correctly created as UUID? I imagine yes, but asking just in case. I mean for example in this case
create_table :solid_queue_claimed_executions do |t|
t.references :job, index: { unique: true }, null: false
t.bigint :process_id
t.datetime :created_at, null: false
t.index [ :process_id, :job_id ]
end
is the resulting table using uuid
for job_id
? Or should you also add this:
create_table :solid_queue_claimed_executions do |t|
t.references :job, index: { unique: true }, null: false, type: :uuid
...
end
Moreover, process_id
is defined as bigint
there, but it should be changed to uuid
as well because it'll include the ID from a record in the solid_queue_process
table 🤔
are the foreign keys in the execution tables also correctly created as UUID?
Yes all foreign keys are also set to uuid.
Moreover,
process_id
is defined asbigint
there, but it should be changed touuid
as well because it'll include the ID from a record in thesolid_queue_process
table 🤔
You're absolutely right. I overlooked changing process_id
to uuid
Here's my current migration for the solid_queue_claimed_executions
table:
create_table :solid_queue_claimed_executions, id: :uuid do |t|
t.references :job, index: { unique: true }, null: false, type: :uuid
t.bigint :process_id
t.datetime :created_at, null: false
t.index [ :process_id, :job_id ]
end
I'll create a new migration to change process_id
as well as supervisor_id
in solid_queue_processes
to UUID.
But do you think this inconsistency could potentially be the source of the "Key (job_id) already exists" error I've been experiencing?
But do you think this inconsistency could potentially be the source of the "Key (job_id) already exists" error I've been experiencing?
I'm not sure; I think it shouldn't, but it would quite possibly interfere with claiming jobs because process_id
is used by the worker to "mark" claimed executions as theirs, but I'd expected this to surface on a different line than the one you're getting the error on. It seems your error is happening on the line marked with *
here:
job_data = Array(job_ids).collect { |job_id| { job_id: job_id, process_id: process_id } }
SolidQueue.instrument(:claim, process_id: process_id, job_ids: job_ids) do |payload|
* insert_all!(job_data)
where(job_id: job_ids, process_id: process_id).load.tap do |claimed|
block.call(claimed)
I'd expect the process_id
wrong type to cause problems right after that 🤔
It could also have caused issues with releasing claimed executions after a worker is shutdown, but that would just have left jobs in claimed state without being run, but not cause duplicated jobs there, because the jobs would have been claimed, they wouldn't have been "claimable".
Let me know how it goes after that migration. If you hit the error again, could you keep the problematic jobs and their executions? I'd like to inspect them to see if that has a clue about where the problem actually is.
Hey @asndiallo, did you hit the issue again after running that migration for process_id
?
Hi @rosa I haven't encountered the issue ever since. Maybe the inconsistant column types were the root cause of this. I'm going to go ahead and close this issue. Thank you so much for your assistance :)
Awesome! Thanks a lot 🙏
Description
I'm encountering a persistent error when SolidQueue attempts to claim jobs. The error occurs even after cleaning up stale jobs and trying to create completely different jobs.
Error Message
Environment
Additional Information
Attempted Solutions
Questions
Complete logs