socialpandas / sidekiq-superworker

Directed acyclic graphs of Sidekiq jobs
MIT License
438 stars 34 forks source link

Children of Parallel Superworkers may enqueue Parent's Next Subjob to Sidekiq multiple times #14

Open thoughtchad opened 10 years ago

thoughtchad commented 10 years ago

Let me start this with - Thank You @SocialPanda for Superworkers and Monitor! It really makes for a sweet combo for what I am doing with Sidekiq.

The Issue: If a superworker has sequential groups of parallel subjobs with children that all finish at the same time (within milliseconds) - it is possible for the parent next subjob to be enqueued multiple times. This could probably manifest itself in other ways, but this was the core symptom I was seeing. The root cause is in descendants_are_complete >> enqueue(next_subjob).

Example: Superworker parallel do quick_job quick_job quick_job quick_job quick_job end parallel do quick_job_2 quick_job_2 quick_job_2 quick_job_2 quick_job_2 end end

SOMETIMES, not always the second parallel set will get sent to sidekiq multiple times. The more quick_jobs (or child subjobs that finish quickly and/or around the same time) there are the higher the probability this can occur.

The whole approach of this fix was to minimally change code. Overall the fix is a band-aid. Let me know if you want a pull request.

chad-lancour@373d06a56105752b990afd6e5ac408a26a6bba48

The quick fix was to lock the next subjob row, update it to an interim status of 'queued' to ensure multiple complete subjobs would not pick it up. Then pass the original next_subjob with the original status of initialized. I decided to send the original subjob because I did not want to change the enqueue logic that only processes initialized, and generally perhaps break other areas of logic.

I think a better fix would involve thoroughly decoupling the subjob complete process from the enqueue process.

  1. Refactor code to ensure that each subjob is only updating it's own subjob record. Don't touch parents, and relatives, and children, etc.
  2. Create a separate process in it's own thread that is responsible for monitoring parallel, batch and superworker job states. This process would determine what's next, properly update subjob records, and send subjobs to sidekiq. Overall this might entail using a proper queuing mechanism designed for highly parallel environments and could better support multi-node sidekiq clients... probably makes sense to move a portion of this into redis.

Side notes: A. During debugging and testing, I implemented optimistic locking on the sidekiq_superworker_subjobs table by adding lock_version col and setting ActiveRecord::Base.lock_optimistically = true. The result overall, is that there are several places where the code is actually updating subjob records that are stale. Even though most of the time it is just the updated_at field that is stale - the results show that "completes" by many subjobs end up touching a wide swatch of subjob records.

B. Another tidbit I noticed was mixed use of subjob.update_attribute and subjob.update_attributes. These two methods differ quite a bit in what they do, and some differences between rails 3 & 4 as well. I would suggest just using update_attributes for consistency. Also, those methods will just return a true/false, even if an active record error occurs. Perhaps the update_attributes! approach to just be safe?

C. FYI, our environment: JDK7_25, Torquebox 3.0, JRuby 1.7.11, Rails 4.0.3, ActiveRecord on MySql 5.1.69 Sidekiq 2.17.7, Sidetiq 0.5.0, Sidekiq-benchmark, Sidekiq-failures, Sidekiq-unique-jobs Sidekiq Client concurrency 30