jondot / sneakers

A fast background processing framework for Ruby and RabbitMQ
https://github.com/jondot/sneakers
MIT License
2.25k stars 332 forks source link

Connectivity issues on heroku to cloudamqp #291

Closed jgrau closed 7 years ago

jgrau commented 7 years ago

First of all thanks for an awesome gem - we use it extensively on our platform.

We are hosted on Heroku and use CloudAMQP. I am trying to debug an issue we see on every deploy/restart of our workers. The workers fail to connect for up to 20 minutes then suddenly connects correctly. I have been working with CloudAMQP and Heroku support on this and received the following thorough response from Heroku:

Hi Jonas,

Networking is available immediately at Dyno boot, including DNS resolution. I can reproduce this issue consistently, your app doesn't connect to CloudAMQP for the first ~15-20 minutes of it's execution. Afterwards, it appears to connect, or at least the error messages stop spewing to your log stream every ~2 seconds.

My gut feeling is that something else is at play with either Sneakers or Bunny, but the extent of my research so far has not turned anything up. This ticket is remaining open while I discuss it with colleagues but I did want to acknowledge that we see the issue as well, and that I don't have any solution. Nothing at the platform level is preventing this connection from taking place, but I'll see what I can do to determine why these connection errors are occurring.

I did look at what I expected to be standard configuration files (config/initializers/sneakers.rb) but didn't see anything obvious to do with host connections as opposed to message / queue specific configuration. Is there a connection timeout or other related option you can increase to see if the connection attempt is just over-eager?

Thanks! Jason

Jonas,

I ran some further tests with an engineer and we're confident this is a library issue of some sort. While the rake sneakers:run task continues to fail, tools that establish a host level connection with the destination host connects without an issue. This may point to an overly stringent timeout, authentication issues, or something else of the sort:

$ nc -w5 -vz hare.rmq.cloudamqp.com 5672
Connection to hare.rmq.cloudamqp.com 5672 port [tcp/amqp] succeeded!
Via netcat (nc), we can make a TCP connection to CloudAMQP's endpoint without issue.

Immediately after, or even during this attempt, your application's connection process fails:

$ bundle exec rake sneakers:run
2017-05-26T17:33:44Z p-2158 t-os66erfyg WARN: Loading runner configuration...
2017-05-26T17:33:44Z p-2158 t-os66erfyg WARN: Loading runner configuration...
2017-05-26T17:33:44Z p-2158 t-os66erfyg WARN: Loading runner configuration...
2017-05-26T17:33:44Z p-2161 t-os66erfyg WARN: Could not establish TCP connection to hare.rmq.cloudamqp.com:5672: No route to host - connect(2) for 52.19.224.195:5672
Unexpected error Could not establish TCP connection to any of the configured hosts

As I previously suggested, please look into connection, authentication, or subscription timeouts. Something seems over-aggressive in giving up before the entire process is allowed to take place.

Let me know if you have any further questions.

Thanks! Jason

My config:

require 'sneakers'
require 'sneakers/handlers/maxretry'
require 'sneakers_sentry_reporter.rb'

# Require all our workers. This is required for sneakers
# to know which queues and bindings to set up
Dir[Rails.root.join('app', 'workers', '*.rb')].each do |file|
  require file
end

config = {
  amqp: ENV['CLOUDAMQP_URL'] || 'amqp://guest:guest@localhost:5672',
  handler: Sneakers::Handlers::Maxretry,
  daemonize: false,
  workers: 1,
  threads: 1,
  prefetch: 1,
  share_threads: true,
  connection: Rails.env.test? ? BunnyMock.new.start : nil,
  error_reporters: [SneakersSentryReporter.new],
  exchange_type: :topic
}

Sneakers.configure(config)
Sneakers.logger.level = Logger::WARN

I realize that this is probably not directly a sneakers issue so I guess it's more of a question about whether this rings any bells or whether someone knows of an aggressive timeout being set.

Thanks for a great gem!

jgrau commented 7 years ago

I think I've resolved this issue and perhaps discovered a bug(?):

Without specifying the :connection option sneakers will generate a connection for every worker each with one channel open. In my application that is about 50 workers which opened 50 connections to rabbitmq. Using a service like CloudAMQP that would fail after about 40 connections. For example I tried running the following in the rails console

50.times { Bunny.new(ENV['CLOUDAMQP_URL']).start }

It would make a bunch of connections and then raise an exception.

Solution: Setting the configuration option :connection (in my case to Bunny.new(ENV['CLOUDAMQP'])) instead makes sneakers open 1 connection to rabbitmq with 50 channels which generally seems more appropriate.

From looking at the sourcecode it seems workergroup is iteration over each worker which generates a new Queue and that queue opens a new bunny connection (unless the configuration :connection is set).

michaelklishin commented 7 years ago

@jgrau this is not very well documented and can be made more efficient but it can be argued that the current approach also has benefits and is more straightforward in some ways.

@jondot @gabrieljoelc WDYT about introducing a way for worker groups to use a shared connection?

michaelklishin commented 7 years ago

I updated the docs to mention both the connection option and connection sharing (or lack of).

michaelklishin commented 7 years ago

I could swear there was a PR that added shared connection support and sure enough, it's #266 by @mikebobrov :)