Closed seanh closed 3 years ago
On h-periodic prod I'm currently seeing:
celery.beat.SchedulingError: Couldn't apply scheduled task report-sync-annotations-queue-length: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'celery' in vhost 'h': received 'true' but current is 'false'
It's an exception raised by /usr/local/lib/python3.8/site-packages/amqp/channel.py
.
The original exception raised by amqp/channel.py
is amqp.exceptions.PreconditionFailed: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'celery' in vhost 'h': received 'true' but current is 'false'
.
It seems to be happening very frequently.
It seems to be happening for all the periodic tasks (report-sync-annotations-queue-length
, sync-annotations
and various purge-*
tasks).
I'm not seeing the exception in QA but that's to be expected: periodic tasks for h don't run on QA (only Checkmate ones do).
The report-sync-annotations-queue-length
task does still seem to be working: I'm still seeing queue length stats on the job queue dashboard.
According to https://github.com/MassTransit/MassTransit/issues/370 this happens because the queue already exists as a non-durable queue (current is 'false'
) but h-periodic is trying to create it as a durable queue (received 'true'
). That sounds right to me based on a reading of the error message.
I think we probably want these queues to be durable but apparently they're currently not durable. No wait, these are periodic tasks so we actually want them to be non-durable. h-periodic is running a Celery beat processes that emits an instance of each task into the Celery queue on a regular schedule. There's no need for the queue to be durable since the beat process just emits tasks regularly. In fact you wouldn't want it to be durable.
I don't see anything obvious about this in the Celery release notes from 5.0.0a1
onwards
I don't know why h-periodic started trying to declare the non-durable "celery"
exchange as durable when it apparently wasn't trying to declare it so before (or at least it wasn't triggering this error before). But the fix seems clear: we don't want this exchange to be durable, the exchange is correctly not durable in CloudAMQP, but h-periodic is trying to declare it as durable, so we just need to change h_beat.py
to declare the exchange as durable=False
.
The problem seems to be that the celery
exchange in the Canadian RabbitMQ cluster is incorrectly set to non-durable (though no one has verified this in RabbitMQ yet). The logs from the Canadian h and h-periodic were being mixed in with the main h and h-periodic's logs. Now that we've separated the logs in Papertrail we're no longer seeing the error message in the main h or h-periodic's logs.
See: https://github.com/hypothesis/h-periodic/pull/132#issuecomment-920810821
@robertknight noticed this in the h-periodic logs:
Slack thread: https://hypothes-is.slack.com/archives/C074BUPEG/p1626098107084000