Sidekiq workers dropping concurrent jobs without errors

noahkconley commented 1 year ago

We're running into an issue with Sidekiq and job concurrency. We're using Sidekiq v 6.4.2, running on AWS Elasticache for Redis, v7. The container is running Debian v11. The pods each have 3G of memory.

When we have JOB_CONCURRENCY set to 3 or more, and all 3 threads have picked up jobs, the worker disappears from the dashboard and the job seems to have been dropped.

This doesn't happen when JOB_CONCURRENCY is set to 2 or 1. We've tried increasing the worker's memory to way more than is needed, but they still fail. At the point of failure, barely any memory has been consumed. No error messages show in our DataDog logs. Neither the container running the job or the pod have crashed or show any other signs of being unhealthy. The worker processes simply "disappear" from the sidekiq busy dashboard.

We've tried scouring the internet for solutions to this issue, most of the discussions we've seen are years old, and now we are at a loss as to how to continue debugging this issue. Any assistance would be greatly appreciated.

Ruby version: 3.1.4 Rails version: 6.1.7.6 Sidekiq version: 6.4.2 Sidekiq Pro version: 5.3.1 Sidekiq Enterprise version: 2.3.1

Please include your initializer, sidekiq.yml, and any error message with the full backtrace. config/initializers/sidekiq.rb:

require 'datadog/statsd'

def use_redis_password
  local = Rails.env.development? || (Rails.env.test? && ENV.fetch('CI', nil).nil?)
  review_app = ENV['ICC_ENV'] == 'review' && Rails.env.production?
  local || review_app
end

redis_password = use_redis_password ? ENV['REDIS_DEV_PASSWORD'] : nil
redis_config = { url: ENV['REDIS_URL'],
                 password: redis_password,
                 ssl_params: { verify_mode: OpenSSL::SSL::VERIFY_NONE } }

Sidekiq::Pro.dogstatsd = lambda do
  Datadog::Statsd.new('localhost', ENV.fetch('INSTRUMENTATION_PORT', '8125').to_i, namespace: 'sidekiq')
end

Sidekiq.configure_server do |config|
  config.redis = redis_config
  config.default_job_options = { retry: false }
  config.retain_history(30)
  config.server_middleware do |chain|
    require 'sidekiq/middleware/server/statsd'
    chain.add Sidekiq::Middleware::Server::Statsd
  end
  config.death_handlers << lambda { |job, ex|
    msg = "Sidekiq max retry of #{job['retry']} exceeded: job_id: #{job['jid']}, " \
          "class: #{job['class']}, message: #{ex.message}."
    Rails.logger.error(msg)
    Airbrake.notify(ex)
  }
end

Sidekiq.configure_client do |config|
  config.redis = redis_config
  config.default_job_options = { retry: false }
end

If you are using an old version, have you checked the changelogs to see if your issue has been fixed in a later version? I've looked through the changelogs and don't see anything that seems relevant, but I also don't really know what I'm looking for since we have no error messages.

https://github.com/sidekiq/sidekiq/blob/main/Changes.md https://github.com/sidekiq/sidekiq/blob/main/Pro-Changes.md https://github.com/sidekiq/sidekiq/blob/main/Ent-Changes.md

mperham commented 1 year ago

Ok, I was looking for your initializer, which looks fine. "It just disappears" is hard to debug.

A process will disappear from the Busy page if its heartbeat data expires in Redis. The heartbeat data lives for 60 seconds and the heartbeat thread refreshes it every 5 seconds. Are you using any other 3rd party Sidekiq gems or plugins? Have you tried upgrading to see if the problem is fixed in a later version?

BTW JOB_CONCURRENCY is not something Sidekiq supports. I'm not sure what that does.

mperham commented 1 year ago

I would at least upgrade to the latest 6.5 and see if that helps.

noahkconley commented 1 year ago

Apologies for the confusion, I've discovered that JOB_CONCURRENCY is an env variable we pass to sidekiq, it's for the --concurrency option. Sidekiq is initialized in our dockerfile as follows:

DB_STATEMENT_TIMEOUT=0 bundle exec sidekiq -v -c $JOB_CONCURRENCY -q default -q accounting -q billing_platform -q reports

We can definitely try upgrading and see if that helps.

noahkconley commented 1 year ago

We've upgraded sidekiq to 6.5.12, sidekiq-pro to 5.5.8, and sidekiq-ent to 2.5.3, but we're seeing no change in behavior, the jobs get picked up and the worker disappears. I would love to provide you with error logs and a backtrace, but they seem to be failing completely silently. You mentioned a "heartbeat" thread, is there a way to monitor that?

mperham commented 1 year ago

Sorry to hear that. I'm not sure how to debug it via GitHub comments. Can you reproduce it locally?

noahkconley commented 1 year ago

It seems like sidekiq is not the issue here, we're using a gem called tiktoken_ruby which is known to have thread safety issues, so we're getting deadlocks on the worker but it's not being reported in any way. Using Sidekiq::Limiter in the job we're calling seems to do what we need it to do.

mperham commented 1 year ago

Ah ok, sounds like you are seeing deadlocks, which would cause the process to silently disappear on the Busy page. Note that the recent versions of Enterprise support a new Kubernetes health check which would detect this problem (obviously only useful if you are using k8s).

https://github.com/sidekiq/sidekiq/wiki/Kubernetes#sidekiq-enterprise

sidekiq / sidekiq

Sidekiq workers dropping concurrent jobs without errors #6084