until_and_while_executing jobs can conflict with themselves

exop-it commented 2 years ago

Describe the bug Jobs for workers that use until_and_while_executing sometimes get stuck because they appear to conflict with themselves.

Expected behavior Executing StatusRequestWorker.perform_async(123) should run the job.

Current behavior Note: The problem only occurs sporadically.

A SidekiqUniqueJobs::Conflict error is raised. The error message says "Item with the key: uniquejobs:0a7563aa8a2c1b4d81c6c8d02a4b3e96:RUN is already scheduled or processing".

Further inspection of the locks reveals that the lock is taken by the job itself (JID 28b40e216188051797e08954 in this example). Since the lock is never released, the job does not execute and eventually dies once the retry attempts are exhausted.

Worker class

class StatusRequestWorker
  include Sidekiq::Worker

  sidekiq_options lock: :until_and_while_executing, on_conflict: { client: :log, server: :raise }

  def perform(id)
    # ...
  end
end

Additional context

Using sidekiq-unique-jobs v7.1.5

Screenshots:

Screenshot 2021-08-06 at 15-38-47 Sidekiq Pro

Screenshot 2021-08-06 at 15-38-57 Sidekiq Pro

mhenrixon commented 2 years ago

I have an idea, I'll see if I can find a way to finally fix this in the next few days.

cgunther commented 2 years ago

I think I'm hitting something similar, using v7.1.12. (I've peeked at what's changed since up to v7.1.15 but nothing seems like it'd affect this)

Here's my worker options:

sidekiq_options lock: :until_and_while_executing,
                on_conflict: { client: :log, server: :raise },
                retry: 5,
                dead: false

I used the reflections API to subscribe to debug to try to better understand the locking steps and spot a difference between when the issue is hit and when it's not:

Sidekiq.logger.level = Logger::DEBUG
SidekiqUniqueJobs.reflect do |on|
  on.debug do |*args|
    Sidekiq.logger.debug(args.inspect)
  end
end

Here's the results for a bad job execution (conflicts with itself while running) and a good job execution (everything runs as expected), with some notes: https://gist.github.com/cgunther/ed1ae5a308681f9cc6204e5dc76e5dc6

On the client side (UntilAndWhileExecuting#lock), everything seems consistent in both cases.

On the server side, however, things get interesting.

The unlocking from UntilAndWhileExecuting#execute seems consistent in both. Then the enqueueing from WhileExecuting#execute seems consistent in both.

Then in the bad scenario, there's another enqueueing, though this time without the :RUN suffix: https://gist.github.com/cgunther/ed1ae5a308681f9cc6204e5dc76e5dc6#file-bad-txt-L29-L36

That doesn't happen in the good scenario, there seems to be a single enqueueing then goes right to primed.

You'll notice that when the bad scenario gets to primed, the digest is not for :RUN, and the rest of the calls are missing that suffix too: https://gist.github.com/cgunther/ed1ae5a308681f9cc6204e5dc76e5dc6#file-bad-txt-L38

Whereas the good scenario has :RUN in the primed line and all subsequent lines: https://gist.github.com/cgunther/ed1ae5a308681f9cc6204e5dc76e5dc6#file-good-txt-L31

The bad scenario is primed and locked without the suffix, then the next thing that happens is raising due to the conflict with the suffix, as if it's either checking for the lock or re-taking the suffixed lock again.

The only thing that stands out to me is there's about a 0.1 second gap between the debug line after enqueueing and the second call to the queue script in the bad scenario: https://gist.github.com/cgunther/ed1ae5a308681f9cc6204e5dc76e5dc6#file-bad-txt-L27-L32

Could this be a sign of a race condition due to some delay?

Both examples were taken from our staging environment, with almost non-existant traffic and a local redis install, so I can't imagine it's an issue around high load or network latency. We do see this in production too, with greater frequency, but focused my efforts so far on staging since there should be less variables there with lower load.

Unfortunately I can't seem to replicate this locally, otherwise I'd try to dig into the code and get some backtraces around that second enqueue that seems to be the source of the problem and figure out what's triggering it.

In the meantime, I'm going to try setting the lock_timeout to 1 to see if that helps at all.

Thanks in advance!

cgunther commented 2 years ago

I dug further into this and believe I worked up a reproduction script: https://gist.github.com/cgunther/718379c4786e445e75a4e62ce612a9ee

I think it's a latency issue. Using toxiproxy with just 20ms latency I can reliably reproduce the issue via that script, just be sure toxiproxy is running first (toxiproxy-server).

Setting a 1 second lock time (SidekiqUniqueJobs.config.lock_timeout = 1) makes the test pass.

Ultimately I'm not sure if this is a bug or working as designed/configured. The locksmith calls pop_queued asynchronously, and without a timeout that ends up calling rpoplpush, then the promise wrapping pop_queued seems to timeout quickly with the lack of a timeout set, and rpoplpush hasn't returned yet, so the gem kind of aborts, but the rpoplpush call eventually succeeds, leaving data in redis, which then causes issues when retried? Or at least that's what I'm thinking is happening?

There are notes about this in the readme about the while executing lock:

NOTE Unless a conflict strategy of :raise is specified, if lock fails, the job will be dropped without notice. When told to raise, the job will be put back and retried. It would also be possible to use :reschedule with this lock.

NOTE Unless this job is configured with a lock_timeout: nil or lock_timeout: > 0 then all jobs that are attempted to be executed will just be dropped without waiting.

But there still seems to be a conflict in that when specify the conflict strategy of :raise, while your job is technically retried, it seems it's unlikely to ever succeed due to the lingering lock, not exactly the "retry" behavior I'd expect. Should the gem try to clean up the lock in the case of such a timeout, but then I'm not sure how to ensure we've waited long enough for the rpoplpush to have finished before clearing the lock.

That's as far as I've gotten so far.

mhenrixon commented 2 years ago

Thank you for the detailed report and the reproduction script. There are some rare situations that I haven't been able to completely solve.

I'll see if I can dig into this later this week.

cgunther commented 2 years ago

Thanks!

I ended up setting SidekiqUniqueJobs.config.lock_timeout = 1 and after a day and a half in production, I've seen drastically fewer conflicts. My hunch is that's now higher than any latency I'm seeing?

I saw a single conflict since, but it appears it resolved itself on the first retry (as opposed to waiting for the lock_ttl to expire before succeeding on retry), so it may have been a genuine conflict, as opposed to this root issue.

ezekg commented 1 year ago

@cgunther did setting lock_timeout = 1 solve your issue, or is it still happening? I did the same and am still seeing occasional jobs conflicting with themselves, but less frequently. They eventually end up in the dead queue because the retries also conflict. Currently trying lock_timeout = 5 to see if that does anything.

In my case, these are jobs that are scheduled into the future (from 1 min to 1 month, depending on client config). I want to ensure there is only 1 job queued at one time, per unique argument set. The newest job should take priority over any job that is currently enqueued, i.e. the newer job should replace the older job.

class UniqueWorker
  include Sidekiq::Worker
  include Sidekiq::Throttled::Worker

  sidekiq_throttle concurrency: { limit: 25 }
  sidekiq_retry_in { 1.minute.to_i }
  sidekiq_options queue: :critical,
    lock: :until_and_while_executing,
    on_conflict: {
      client: :replace,
      server: :raise,
    }

  def perform(resource_id)
    # ...
  end
end

The main motivation here is to prevent users from queuing up too many jobs into the future, which could be a Redis DoS vulnerability. Ideally, we have 1 job queued per argument set, and that job is guaranteed to run at least once.

If the job is running and fails due to an exception, it's expected to quickly retry, but remain unique even when in the retry queue. If a newer job is queued, it should replace the job in the retry queue with the newer job.

So far, the until_and_while_executing strategy with on_conflict: :replace has worked well for this, aside from the occasional locking issue as explained by OP and others.

These are pretty critical jobs, and we need to assert at least 1 job eventually runs.

cgunther commented 1 year ago

I don't believe it solved it outright, but I do believe it's greatly improved. I've since bumped it up to lock_timeout = 2 and seeing fewer conflicts. The conflicts are so rare, I haven't really dug into whether they're genuine conflicts, or this same root issue.

I rarely see them hit the dead queue, but I also set a lock_ttl on these jobs to help ensure they're not locked indefinitely, which could lead to enough failures they hit the dead queue. I set the lock_ttl significantly higher than I'd ever expect the job to actually take (30 minutes) to try to guard against a lock expiring in the middle of a job run.

In my case, I'm not dealing with scheduled jobs, I'm enqueueing jobs to be run as soon as possible.

ezekg commented 1 year ago

Thanks for the additional info. I actually think I'm going to try to switch to an until_executing locking strategy, since that seems to make more sense in my case, and perhaps it will resolve my issue here. Can't afford to lose these jobs.

@mhenrixon, I sponsored the repo on behalf of @keygen-sh. If you have time to look into this, LMK.

mhenrixon commented 1 year ago

@ezekg I'll see about adding toxiproxy and reproduce this in a test. There was a ton of work going into stabilizing the gem and this is one of the few problems I never got to the absolute bottom of.

I'll make it my main priority in the next week.

mhenrixon commented 1 year ago

So I haven't managed to break anything yet. I made sure that toxiproxy waits for way longer than any timeout but my locks still succeed. Perhaps toxic isn't the right one?

Anyway, tricky bug 🐞 to track down.

ezekg commented 1 year ago

Did you try @cgunther's reproduction case here? Fails locally for me with toxiproxy.

mhenrixon commented 1 year ago

https://github.com/mhenrixon/sidekiq-unique-jobs/pull/728/files#diff-2237c3b9fde8071dcf3d26aee337339194f7a141e90801bf3d9c06fe81e96ff8 I have no previous experience with toxiproxy so might be doing something wrong. If one of you could give me any pointers on replicating this in RSpec, that would be greatly appreciated.

@cgunther any idea what might be wrong with my toxiproxy usage?

cgunther commented 1 year ago

I think it's a sidekiq issue, not a toxiproxy issue.

First, I think you need sidekiq to actually perform the job inline, not just enqueue (the default when including "sidekiq/testing", ie. fake!):

Sidekiq::Testing.inline!

I believe the root bug is around job execution, not job enqueueing, hence the importance of actually performing the job. The errors I saw in my application were always when sidekiq tried to perform the job, never when sidekiq tried to enqueue the job.

Second, when using the sidekiq test harness, no server middleware is run, so you're not actually executing any of the unique locking code. You need to configure test middleware separately:

Sidekiq::Testing.server_middleware do |chain|
  chain.add SidekiqUniqueJobs::Middleware::Server
end

More info: https://github.com/mperham/sidekiq/wiki/Testing#testing-server-middleware

If I add those two bits to your before block in your PR, the test fails (as we're expecting). If I reduce the toxiproxy latency to 20 and ditch the jitter (matching my repro), the test also fails. If I drop the toxiproxy latency to 15, the test passes, so at least proving with little-to-no latency, things proceed without error.

Progress!

mhenrixon commented 1 year ago

Thanks a bunch for that additional information! Managed to replicate the problem now.

mhenrixon commented 1 year ago

Please check out v7.1.26 https://github.com/mhenrixon/sidekiq-unique-jobs/releases/tag/v7.1.26 it should hopefully render this a non-problem @ezekg @cgunther.

I believe what happened was that redis does something silly with 0.04 ao it endsup being 0 for timeout as it only accepts integers.

I am not entirely done with the problem. I want to get to the bottom of it but it fixes the problem so figured I roll it out as it is.

cgunther commented 1 year ago

I deployed v7.1.26, so will keep an eye on things and report back.

For now, I left SidekiqUniqueJobs.config.lock_timeout = 2, which probably negates the improvements you made in #728, as it seems that effectively ensured there was a timeout for the Redis call, even if no timeout was configured.

If all goes well, I'll try without the lock_timeout config being set to really exercise the changes in #728.

mhenrixon / sidekiq-unique-jobs

until_and_while_executing jobs can conflict with themselves #630