linagora / james-project

Mirror of Apache James Project
Apache License 2.0
71 stars 63 forks source link

RabbitMQ MailQueue do not survive restart #3485

Closed chibenwa closed 4 years ago

chibenwa commented 4 years ago

GITLAB-1802

Why

Incident thread: https://chat.linagora.com/openpaas/pl/fg4rz381ufgwtggjfs9qm4c19y

Over 37k messages are stuck in RMQ

Image_Pasted_at_2020-6-19_15-25.png.jpg

it seems not dequeuing, is there any node connected to this queue ?

No consumer indeed

Image_Pasted_at_2020-6-19_15-31.png.jpg

Restarting the node solves the issue

There is a consumer now

Image_Pasted_at_2020-6-19_15-36.png.jpg

Image_Pasted_at_2020-6-19_16-11.png.jpg

How ?

https://chat.linagora.com/openpaas/pl/qesjymi5dfb9uby8nmda6jct7e

it seems that we need resilient channels @Benoît TELLIER

Definition of done

Full RabbitMQ restart should not threaten MailQueue delivery. James is resilient to RabbitMQ errors.

blackheaven commented 4 years ago

JIRA:

chibenwa commented 4 years ago

https://github.com/linagora/james-project/pull/3532

I actually found out that we do not handle nack when we fail to load a mail.

In case many errors are faced, this means that the maximum number of unacknowledged message is reached, resulting in no longer dequeueing taking place.

We need to :

See https://github.com/linagora/james-project/pull/3532

chibenwa commented 4 years ago

Connex issue: https://github.com/linagora/james-project/pull/3534

JAMES-3291 Badly formatted mailqueue causes RabbitMQMailQueue to crash

No clear solution yet...