Clarification on Expected Behaviour

gregfletch commented 3 months ago

Hopefully a quick question to clarify expected behaviour. We are seeing jobs waiting in the Enqueued state due to configured rate limit throttling (expected). However, there may be other jobs which are not throttled that are stuck behind the throttled job(s) that are not getting picked up by the fetcher. Is this the expected behaviour (i.e. jobs that could theoretically be run are stuck waiting until the throttled jobs ahead of them are processed)? Or is it expected that the fetcher would go through the queue and pull out the first job that is allowed to be run? If it is the former, is there a recommendation for how best to work around this (i.e. so Sidekiq is better able to fully utilize its resources while throttled jobs are paused)?

Thanks!

simonc commented 3 months ago

Hi there,

We just experienced a partial outage of our service due to the behavior you mention. A single user generated a huge number of jobs and was throttled. The jobs for other users where not processed, despite thread capacity.

I would indeed expect throttled jobs to wait while other jobs are processed 🤔

A clarification would be indeed very welcome haha 😅

Thanks! ❤️

mnovelo commented 3 months ago

Are you seeing this behavior for jobs on the same queue or are jobs in lower priority queues also waiting? If the latter, could you share your queue configuration?

simonc commented 3 months ago

It's on the same queue with default priority. It's the only queue the worker deals with.

mnovelo commented 3 months ago

Currently the behavior is to add the throttled job back to the queue without any changes, so its enqueued_at value stays the same. Sidekiq then prioritizes jobs in the same queue by their enqueued_at value, which could lead to throttled jobs continuously being chosen over unthrottled ones. There's been discussion about having different behaviors for jobs that are being throttled, but I'm not sure where that work stands https://github.com/ixti/sidekiq-throttled/pull/150

bdegomme commented 2 months ago

See https://github.com/ixti/sidekiq-throttled/issues/52#issuecomment-412911238 : when a job is throttled it's put back at the end of the queue. The issue with other jobs not running is due to the cooldown

Sidekiq::Throttled.configure do |config|
  # Period in seconds to exclude queue from polling in case it returned
  # {config.cooldown_threshold} amount of throttled jobs in a row. Set
  # this value to `nil` to disable cooldown manager completely.
  # Default: 2.0
  config.cooldown_period = 2.0

  # Exclude queue from polling after it returned given amount of throttled
  # jobs in a row.
  # Default: 1 (cooldown after first throttled job)
  config.cooldown_threshold = 1
end

If a queue contains a hundred jobs in a row that will be throttled, the cooldown will kick-in a hundred times in a row, meaning it will take 200 seconds before all those jobs are put back at the end of the queue and you actually start processing other jobs...

It's tempting to set config.cooldown_period = nil, but the cooldown is there to avoid overloading redis when a queue contains only throttled jobs, so probably not a good idea. Personally I set config.cooldown_period = 1.0 and config.cooldown_threshold = 100, and it resolved all practical issues (because it's much faster at moving a batch of throttled jobs at the end of the queue, at the price of more load on Redis).

simonc commented 2 months ago

Hi @bdegomme, thanks for clarifying this. So if I understand it correctly, setting the threshold to 100 and the cooldown to 1.0 would mean that it would exclude the queue from polling for 1s after 100 jobs have been throttled in a row?

Just to make sure, here is my personal situation:

I run a PDF generation service. Each PDF generation is a job in Sidekiq. Sometimes we have one client that will flood us with thousands of documents to generate in a very short period of time. Without throttling, this means that other customers will wait a very long time before their documents get generated.

I thought of sidekiq-throttled to solve this, thinking I would throttle generations based on the customer id as prefix thus preventing a single customer from taking the entire queue for too long. We can handle 30 generations concurrently so I throttled at 15 concurrent jobs per user so that this specific customer can only take half of the capacity.

That's when shit hit the fan as it did throttle his generations but still stopped processing other jobs.

From what you're saying, this was due to the cooldown kicking in and preventing the queue from being polled because too many jobs got throttled, thus preventing other customers generations from being handled.

Is that correct?

(side note: we ended-up creating a dedicated queue for this customer so they could not take over the entire queue but I'm still interested in understanding the issue)

bdegomme commented 2 months ago

yes that's right. Maybe for you it makes sense to increase cooldown_threshold even more, but make sure to do it gradually so you know your Redis server can handle it

simonc commented 2 months ago

Alright! Thanks a lot for taking the time to explain it 🙏 😊

jonatasrancan commented 2 months ago

Hi there,

I had the same problem, we apply throttle on some specific jobs, but since some times the amount of it enqueued is very big, 150k jobs enqueued and the throttle allow only 25k per hour, that was causing the not throttle jobs take forever to run.

I tried to change the cooldown_threshold to a bigger number but it didn't make any difference, do I need to put this code in any specific place? I just used my sidekiq initializer file?

Sidekiq::Throttled.configure do |config|
  # Period in seconds to exclude queue from polling in case it returned
  # {config.cooldown_threshold} amount of throttled jobs in a row. Set
  # this value to `nil` to disable cooldown manager completely.
  # Default: 2.0
  config.cooldown_period = ENV.fetch("SIDEKIQ_COOLDOWN_PERIOD", 2.0).to_f

  # Exclude queue from polling after it returned given amount of throttled
  # jobs in a row.
  # Default: 1 (cooldown after first throttled job)
  config.cooldown_threshold = ENV.fetch("SIDEKIQ_COOLDOWN_THRESHOLD", 1).to_i
end

For now I can predict before enqueue the job if it will be throttled or not, so I just send them to another queue to no cause problem with the others.

bdegomme commented 1 month ago

Yes you should put this in the sidekiq initializer. What did you set SIDEKIQ_COOLDOWN_PERIOD and SIDEKIQ_COOLDOWN_THRESHOLD to? Have you tried 1 and 5000 respectively? Then it should take around ~30s for non throttled jobs to start being processed. You may also consider setting cooldwon_period to nil to disable the cooldown manager and see if things work.

jonatasrancan commented 1 month ago

I tried 2 seconds and 10000 the threshold, but even that way I couldn't see the difference

ixti / sidekiq-throttled

Clarification on Expected Behaviour #188