hypothesis / h-periodic

Periodic tasks for h
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

PRECONDITION_FAILED - inequivalent arg 'durable' #118

Closed seanh closed 3 years ago

seanh commented 3 years ago

@robertknight noticed this in the h-periodic logs:

Jul 12 02:08:00 h-periodic-prod_i-0dc52775870060687 eb-3394d7536e29-stdouterr.log h-beat (stderr)      | [2021-07-12 01:08:00,110: ERROR/MainProcess] Message Error: Couldn't apply scheduled task sync-annotations: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'celery' in vhost 'h': received 'true' but current is 'false'

Slack thread: https://hypothes-is.slack.com/archives/C074BUPEG/p1626098107084000

seanh commented 3 years ago

On h-periodic prod I'm currently seeing:

celery.beat.SchedulingError: Couldn't apply scheduled task report-sync-annotations-queue-length: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'celery' in vhost 'h': received 'true' but current is 'false'

It's an exception raised by /usr/local/lib/python3.8/site-packages/amqp/channel.py.

The original exception raised by amqp/channel.py is amqp.exceptions.PreconditionFailed: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'celery' in vhost 'h': received 'true' but current is 'false'.

It seems to be happening very frequently.

It seems to be happening for all the periodic tasks (report-sync-annotations-queue-length, sync-annotations and various purge-* tasks).

I'm not seeing the exception in QA but that's to be expected: periodic tasks for h don't run on QA (only Checkmate ones do).

The report-sync-annotations-queue-length task does still seem to be working: I'm still seeing queue length stats on the job queue dashboard.

seanh commented 3 years ago

According to https://github.com/MassTransit/MassTransit/issues/370 this happens because the queue already exists as a non-durable queue (current is 'false') but h-periodic is trying to create it as a durable queue (received 'true'). That sounds right to me based on a reading of the error message.

I think we probably want these queues to be durable but apparently they're currently not durable. No wait, these are periodic tasks so we actually want them to be non-durable. h-periodic is running a Celery beat processes that emits an instance of each task into the Celery queue on a regular schedule. There's no need for the queue to be durable since the beat process just emits tasks regularly. In fact you wouldn't want it to be durable.

seanh commented 3 years ago

I don't see anything obvious about this in the Celery release notes from 5.0.0a1 onwards

seanh commented 3 years ago

I don't know why h-periodic started trying to declare the non-durable "celery" exchange as durable when it apparently wasn't trying to declare it so before (or at least it wasn't triggering this error before). But the fix seems clear: we don't want this exchange to be durable, the exchange is correctly not durable in CloudAMQP, but h-periodic is trying to declare it as durable, so we just need to change h_beat.py to declare the exchange as durable=False.

seanh commented 3 years ago

The problem seems to be that the celery exchange in the Canadian RabbitMQ cluster is incorrectly set to non-durable (though no one has verified this in RabbitMQ yet). The logs from the Canadian h and h-periodic were being mixed in with the main h and h-periodic's logs. Now that we've separated the logs in Papertrail we're no longer seeing the error message in the main h or h-periodic's logs.

See: https://github.com/hypothesis/h-periodic/pull/132#issuecomment-920810821