onebeyond / rascal

A config driven wrapper for amqp.node supporting multi-host connections, automatic error recovery, redelivery flood protection, transparent encryption / decryption and channel pooling.
MIT License
451 stars 69 forks source link

Rascal fatal error - Timed out waiting for broker to confirm publication #217

Closed leidyjbr5 closed 1 year ago

leidyjbr5 commented 1 year ago

Hello Team,

First of all, thanks so much for this awesome library.

I wonder if you could help us with any clues about what is causing this error ->

Fatal Error: Timedout after 10000ms waiting for broker to confirm publication to: tickets, with error stack:
     Error: Timedout after 10000ms waiting for broker to confirm publication to: tickets
     at Timeout._onTimeout (/service/node_modules/rascal/lib/amqp/Publication.js:205:10)
     at listOnTimeout (internal/timers.js:554:17)
     at processTimers (internal/timers.js:497:7)

The error above starts appearing in the logs on production 1-2 times a day every 2 days.

This error had already appeared a few weeks ago and we decided to increase the acquire pool time of our database to 10000 ms and the heartbeat of the queue configuration to 20. This is an example of our queue config:

class Queue {
  constructor({ queueHost, queuePort, queueUsername, queuePassword } = {}) {
    return (async () => {
      this.broker = await Rascal.BrokerAsPromised.create(
        Rascal.withDefaultConfig({
          vhosts: {
            '/': {
              connection: {
                hostname: queueHost,
                port: queuePort,
                user: queueUsername,
                password: queuePassword,
                options: {
                  heartbeat: 20
                }, 
              },
            },
          },
        }),
      )
      return this
    })()
  }

  async publish({ exchange = '', message = '' } = {}) {
    return this.broker.publish(exchange, message)
  }

  async subscribe({ queue = '', onMessageHandler = () => {}, onErrorHandler = () => {}, logger } = {}) {
    logger.info(`Connected to ${queue}`)
    const subscription = await this.broker.subscribe(queue)
    return subscription.on('message', onMessageHandler).on('error', onErrorHandler)
  }
}

At this point we are not have any idea why this error is happening or how to reproduce it on demand. All we know is that when this starts occurring. Rascal stops sending message to RabbitMQ until we restart the container.

We are using version 9.1.3 of Rascal to connect to RabbitMQ 3.8.

Regards, Leidy

cressie176 commented 1 year ago

Hi @leidyjbr5,

Thank you for taking the time to explain the problem so clearly. The error you are seeing is because Rascal is publishing a message to the RabbitMQ broker using a confirm channel, but not receiving an acknowledgement from the broker that the message was safely received within 10 seconds. Normally the acknowledgement should arrive almost immediately.

The code which sets the timeout can be seen here. The timeout is cleared after the underlying publish function executes its callback here. There's nothing in Rascal which stops it publishing subsequent messages, however rascals publish function will yield an error so this could interfere with your application if not handled.

So the question is, why is your application not receiving an acknowledgement from the broker?

I have only once seen a RabbitMQ broker not send acknowledgements and we never got to the bottom of it. It happened reproducibly with one node in a three node cluster in our staging environment, but not the other two, or in our production cluster. We were able to reproduce just using amqplib (i.e. without Rascal), and we used WireShark to confirm that no acknowledgement was sent. If you determine this is actually what is happening, then my advice would be to upgrade your cluster to the latest version of RabbitMQ, as they are less likely to help with a two year old version of the broker.

However, I suspect there is a more likely reason in your case. It is possible that the broker is sending an acknowledgement, but that your application's event loop has become saturated with too much work, and is unable to process the acknowledgement within ten seconds. This would also explain why you needed to increase the heartbeat and database acquisition timeout, and why rascal stops publishing messages. This phenomenon is called event loop lag and you can read more about it here. There are multiple ways to monitor your application for event loop lag, so I suggest if you are not already doing this, it should be your first step.

Best,

Steve

leidyjbr5 commented 1 year ago

Hi @cressie176, thank you so much for your answer. We are gonna upgrade to the last version of RabbitMQ and review the info about the event loop lag.

Thanks!!

cressie176 commented 1 year ago

You're welcome. I'll close but feel free to share what you find and ask if you need more help