stompgem / stomp

A ruby gem for sending and receiving messages from a Stomp protocol compliant message queue. Includes: failover logic, ssl support.
http://stomp.github.com
Apache License 2.0
152 stars 80 forks source link

Amazon ActiveMQ drops connection for long running listeners #163

Open austinbittinger-inmar opened 4 years ago

austinbittinger-inmar commented 4 years ago

Hi there! My team currently uses ActiveMQ as a centralized queue between all of our microservices, and we've adopted this gem to interact with ActiveMQ. I've been scratching my head for a little while, as it seems like the connection inevitably drops after 20 minutes. Here's a simplified version of what we're using:

client = Stomp::Client.new(
  hosts: [
    login: ENV['STOMP_USERNAME'],
    passcode: ENV['STOMP_PASSCODE'],
    host: ENV['STOMP_HOST'],
    port: ENV['STOMP_PORT'],
    ssl: true
  ],
  connect_headers: {
    'client-id' => 'my-service',
    'accept-version' => '1.2',
    'host' => 'localhost'
  }
)

client.subscribe('queue/1', id: SecureRandom.uuid, ack: 'client') do |msg|
  Handler1.perform_async(msg)
  client.acknowledge(msg)
end

client.subscribe('queue/2', id: SecureRandom.uuid, ack: 'client') do |msg|
  Handler2.perform_async(msg)
  client.acknowledge(msg)
end

client.join

Running this locally, the client will trigger on_miscerr and reconnect after about 10 minutes, but against Amazon MQ, the connection will drop but the client does not attempt a reconnect.

With a custom logger logging every transaction here, the client receives an on_receive event, and then drops the connection. Do you have any suggestions for how I could go about debugging this issue? I've tried just about every parameter on the client, and logged everything I can. If it helps, this client is running within a Kubernetes pod, pointing at a VPC only configuration of Amazon ActiveMQ.

gmallard commented 4 years ago

Unlikely that this is a gem bug I think. You are operating in a complex network environment. And as with all networking apps, any thing can go wrong at any time.

Any chance of you showing me your logs ?

Make sure your custom logger emits all the information it possibly can. Including original exception ans stack trace data if possible.

In the logger, try things like:

# Log miscellaneous errors
  def on_miscerr(parms, errstr)
    begin
      @log.debug "Miscellaneous Error #{info(parms)}"
      @log.debug "Miscellaneous Error String #{errstr}"
      @log.debug "Miscellaneous Error All Parms #{parms.inspect}"      
            if parms[:ssl_exception]
                       @log.debug "SSL Miscellaneous Error Parms: #{parms[:ssl_exception]}"
                       @log.debug "SSL Miscellaneous Error Message: #{parms[:ssl_exception].message}"
                btr = parms[:ssl_execption].backtrace.join("\n")
                @log.debug "Backtrace SME: #{btr}"
            end
    rescue
      @log.debug "Miscellaneous Error oops"
    end
  end

Do you have access to AMQ logs ? If so, do they show anything "interesting" ?

Looking at your connect hash: have you tried using well selected values for heartbeats ? That is a shot in the dark, but it might help.

gmallard commented 4 years ago

I changed your code above just enough to get it running here. Started it.

Also started two producers. One sends to queue 1 every 30 seconds, the other to queue 2 every 20 seconds.

Connections to AMQ on localhost.

Right now, that has been running for about 20 hours with no failures.

I doubt tat I will be able to recreate this problem.

gmallard commented 4 years ago

I cannot recreate the problem you describe.

I have had your code running for as long as 4 days, with no problems.

If you need help from me I am going to need to see all of the detail in logs from the logger.

Incidentally, there is an enhancement to the example logger the gem provides. It is on the DEV branch only at present.