sidekiq / sidekiq

Simple, efficient background processing for Ruby
https://sidekiq.org
Other
13.13k stars 2.41k forks source link

Massive number of Redis lmove operations after upgrading to Sidekiq 7.2 #6394

Closed pyueli closed 1 month ago

pyueli commented 1 month ago

Ruby version: 3.2 Rails version: 7.1 Sidekiq / Pro / Enterprise version(s): Sidekiq Pro 7.2.0

We see massive number of Redis lmove operations after upgrading to Sidekiq 7.2. One of our services makes 30K lmove Redis requests per second while the sidekiq job_fetch is only 460 requests per second. The Redis DB is used by Sidekiq exclusively.

Understand some of lmove operations are valid since they are triggered by the retrieve_work method of super_fetch. But it cannot explain the huge difference between the numbers of lmove and job_fetch. Any idea on what could be the root cause of the massive number of lmove operations?

Thx!

mperham commented 1 month ago

Can you give me a sampling of the actual commands.? Seeing the key names should tell us the root cause.

pyueli commented 1 month ago

Below is a sample of the lmove command. But they are triggered from job_fetch. LMOVE queue:xxxx... queue:sq|xxxxx... RIGHT LEFT It seems there are massive numbers of lmove commands triggered by some other sidekiq components. We are trying to find out the root cause.

mperham commented 1 month ago

Yep, that’s super fetch. Your LMOVE command count will scale linearly with the number of queues you have, which is why I recommend only using a handful of queues per process. How many named queues do you have?

pyueli commented 1 month ago

The command I pasted LMOVE queue:xxxx... queue:sq|xxxxx... RIGHT LEFT is from super fetch. That is valid. But the problem is we found a huge difference between the number of job_fetch and the number of LMOVE. There must be some other components making tons of LMOVE operations like 30k per second. We only have three queues and we run one process per queue.

mperham commented 1 month ago

I don’t know what “job_fetch” is.

Superfetch is the only code in Sidekiq and Pro which use lmove. Enterprise rate limiting uses it too but you aren’t on Ent. Do you have any other plugins?

mperham commented 1 month ago

Superfetch does use lmove when recovering jobs, not just for fetching. Do you suddenly see thousands of jobs appearing in your queues?

pyueli commented 1 month ago

job_fetch is datadog span name around super fetch. Below is the stack trace.

from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-pro-7.2.0/lib/sidekiq/pro/super_fetch.rb:300:in `retrieve_work'
from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-7.2.2/lib/sidekiq/processor.rb:87:in `get_one'
from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-7.2.2/lib/sidekiq/processor.rb:99:in `fetch'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/contrib/sidekiq/server_internal_tracer/job_fetch.rb:26:in `block in fetch'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/trace_operation.rb:192:in `block in measure'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/span_operation.rb:150:in `measure'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/trace_operation.rb:192:in `measure'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/tracer.rb:380:in `start_span'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/tracer.rb:160:in `block in trace'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/context.rb:43:in `activate!'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/tracer.rb:159:in `trace'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing.rb:16:in `trace'
from /usr/local/bundle/ruby/3.2.0/gems/ddtrace-1.10.1/lib/datadog/tracing/contrib/sidekiq/server_internal_tracer/job_fetch.rb:13:in `fetch'
from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-7.2.2/lib/sidekiq/processor.rb:81:in `process_one'
from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-7.2.2/lib/sidekiq/processor.rb:72:in `run'
from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-7.2.2/lib/sidekiq/component.rb:10:in `watchdog'
from /usr/local/bundle/ruby/3.2.0/gems/sidekiq-7.2.2/lib/sidekiq/component.rb:19:in `block in safe_thread'

We don't see thousands of jobs appearing suddenly. The high number of lmove started right after we upgraded to sidekiq 7.2 We are using both Pro and Enterprise. But the service doesn't use rate limiting.

pyueli commented 1 month ago

Just to confirm with you. We only have three places in Sidekiq, which calls LMOVE.

mperham commented 1 month ago

Correct. Only Ent's concurrent limiter uses LMOVE.

mperham commented 1 month ago

I believe 7.2.0 is when I dropped support for earlier Redis versions and migrated from RPOPLPUSH to LMOVE. Why aren't you using 7.2.4, maybe this was already fixed?

EDIT: Sorry, I realized you were talking about Pro 7.2.0, I assume with Sidekiq 7.2.4.

pyueli commented 1 month ago

One question regarding one of your previous comment:

Your LMOVE command count will scale linearly with the number of queues you have, which is why I recommend only using a handful of queues per process. How many named queues do you have?

Let's say my process has [q1, q2, q3]. Will Sidekiq keep fetching jobs from queues without sleep even if the queues are empty? If that is the case, [q1, q2, q3, q4, q5] will have the same number of LMOVE as [q1, q2, q3] since the Sidekiq keeps running LMOVE to fetch jobs continuously. Why does the number of queues matter?

Also did Sidekiq 7 make some improvement to make super fetch faster compared with Sidekiq 6?

pyueli commented 1 month ago

Close this issue. It turns out to be a service issue.