Closed leidyjbr5 closed 1 year ago
Hi @leidyjbr5,
Thank you for taking the time to explain the problem so clearly. The error you are seeing is because Rascal is publishing a message to the RabbitMQ broker using a confirm channel, but not receiving an acknowledgement from the broker that the message was safely received within 10 seconds. Normally the acknowledgement should arrive almost immediately.
The code which sets the timeout can be seen here. The timeout is cleared after the underlying publish function executes its callback here. There's nothing in Rascal which stops it publishing subsequent messages, however rascals publish function will yield an error so this could interfere with your application if not handled.
So the question is, why is your application not receiving an acknowledgement from the broker?
I have only once seen a RabbitMQ broker not send acknowledgements and we never got to the bottom of it. It happened reproducibly with one node in a three node cluster in our staging environment, but not the other two, or in our production cluster. We were able to reproduce just using amqplib (i.e. without Rascal), and we used WireShark to confirm that no acknowledgement was sent. If you determine this is actually what is happening, then my advice would be to upgrade your cluster to the latest version of RabbitMQ, as they are less likely to help with a two year old version of the broker.
However, I suspect there is a more likely reason in your case. It is possible that the broker is sending an acknowledgement, but that your application's event loop has become saturated with too much work, and is unable to process the acknowledgement within ten seconds. This would also explain why you needed to increase the heartbeat and database acquisition timeout, and why rascal stops publishing messages. This phenomenon is called event loop lag and you can read more about it here. There are multiple ways to monitor your application for event loop lag, so I suggest if you are not already doing this, it should be your first step.
Best,
Steve
Hi @cressie176, thank you so much for your answer. We are gonna upgrade to the last version of RabbitMQ and review the info about the event loop lag.
Thanks!!
You're welcome. I'll close but feel free to share what you find and ask if you need more help
Hello Team,
First of all, thanks so much for this awesome library.
I wonder if you could help us with any clues about what is causing this error ->
The error above starts appearing in the logs on production 1-2 times a day every 2 days.
This error had already appeared a few weeks ago and we decided to increase the acquire pool time of our database to 10000 ms and the heartbeat of the queue configuration to 20. This is an example of our queue config:
At this point we are not have any idea why this error is happening or how to reproduce it on demand. All we know is that when this starts occurring. Rascal stops sending message to RabbitMQ until we restart the container.
We are using version 9.1.3 of Rascal to connect to RabbitMQ 3.8.
Regards, Leidy