jenkinsci / jms-messaging-plugin

https://plugins.jenkins.io/jms-messaging
12 stars 37 forks source link

[RabbitMQ] Messages are sometimes not getting ACKed #202

Open msrb opened 4 years ago

msrb commented 4 years ago

It looks like messages are sometimes not getting acked when using RabbitMQ provider. There are suspicious error messages in Jenkins log:

Jul 24, 2020 8:26:52 AM WARNING com.redhat.jenkins.plugins.ci.messaging.RabbitMQMessagingWorker receive

JMS exception raised, going to re-subscribe for job 'RabbitMQ_Copr_Heartbeat'.
com.rabbitmq.client.AlreadyClosedException: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - unknown delivery tag 147816, class-id=60, method-id=80)
    at com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(AMQChannel.java:258)
    at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:427)
    at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:421)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareChannelN.basicAck(RecoveryAwareChannelN.java:93)
    at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicAck(AutorecoveringChannel.java:428)
    at com.redhat.jenkins.plugins.ci.messaging.RabbitMQMessagingWorker.receive(RabbitMQMessagingWorker.java:181)
    at com.redhat.jenkins.plugins.ci.threads.CITriggerThread.run(CITriggerThread.java:73)

Jul 24, 2020 8:26:52 AM WARNING com.redhat.jenkins.plugins.ci.messaging.RabbitMQMessagingWorker unsubscribe

Exception occurred when closing channel: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - unknown delivery tag 147816, class-id=60, method-id=80)

Jul 24, 2020 8:26:52 AM INFO com.redhat.jenkins.plugins.ci.messaging.RabbitMQMessagingWorker subscribe

Subscribing job 'RabbitMQ_Copr_Heartbeat' to org.fedoraproject.prod.copr.build.start topic.

Jul 24, 2020 8:26:52 AM INFO com.redhat.jenkins.plugins.ci.messaging.RabbitMQMessagingWorker subscribe

Successfully subscribed job 'RabbitMQ_Copr_Heartbeat' to topic 'org.fedoraproject.prod.copr.build.start'.

This happens on 2 different Jenkins instances. The first is running in CentOS infrastructure, on OpenShift, and the second is deployed on AWS EKS.

Note Jenkins seems to be triggering builds on messages just fine. The problem is that messages that don't get acked will be re-delivered by broker after some time and Jenkins will thus trigger new builds again.

This problem was discovered by Fedora monitoring: https://pagure.io/fedora-ci/general/issue/125.

pypingou commented 3 years ago

Considering https://www.mariuszwojcik.com/how-to-choose-prefetch-count-value-for-rabbitmq/ , https://github.com/fedora-infra/fedora-messaging/pull/224 and the fact that the prefetch value is set to 0, I now wonder if the issue we've seen in the ticket above could not be linked to that configuration. Maybe setting the prefetch value to something other than 0 would mitigate this

jeremycline commented 3 years ago

Considering https://www.mariuszwojcik.com/how-to-choose-prefetch-count-value-for-rabbitmq/ , fedora-infra/fedora-messaging#224 and the fact that the prefetch value is set to 0, I know wonder if the issue we've seen in the ticket above could not be linked to that configuration. Maybe setting the prefetch value to something other than 0 would mitigate this

I don't see how they are related. The error is pretty clear, the channel is closed because the delivery tag (147816) is unknown. Delivery tags are how acks/nacks identify the message they are acking/nacking, are server-assigned, and are scoped to a channel. This, at first glance, looks to be a client bug. For example, if it's receiving the message on one channel and sending an ack for that message using a different channel, that would result in this error.