bug: consumer timeout on infrahub.rpcs queue prevents "large" Infrahub tasks from being completed

wvandeun commented 2 months ago

Component

API Server / GraphQL

Infrahub version

0.16.0.dev0

Current Behavior

When you have an Infrahub generator that takes longer than 30s to execute, the message on the rpcs queue does not get ack'ed within the 30s consumer timeout that is defined.

2024-09-10 07:20:23.373834+00:00 [warning] <0.1240.0> Consumer 'ctag2.eabde5b3b02242c6a9f3274f00637d1a' on channel 2 and queue 'infrahub.rpcs' in vhost '/' has timed out waiting for a consumer acknowledgement of a delivery with delivery tag = 328. Timeout used: 30000 ms. This timeout value can be configured, see consumers doc guide to learn more
2024-09-10 07:20:23.374431+00:00 [error] <0.1240.0> Channel error on connection <0.1214.0> (172.18.0.6:59786 -> 172.18.0.2:5672, vhost: '/', user: 'infrahub'), channel 2:
2024-09-10 07:20:23.374431+00:00 [error] <0.1240.0> operation none caused a channel exception precondition_failed: delivery acknowledgement on channel 2 timed out. Timeout value used: 30000 ms. This timeout value can be configured, see consumers doc guide to learn more
2024-09-10 07:21:13.354275+00:00 [warning] <0.1210.0> Consumer 'ctag2.c220544239c646cbb699c3c25179de29' on channel 2 and queue 'infrahub.rpcs' in vhost '/' has timed out waiting for a consumer acknowledgement of a delivery with delivery tag = 568. Timeout used: 30000 ms. This timeout value can be configured, see consumers doc guide to learn more
2024-09-10 07:21:13.354566+00:00 [error] <0.1210.0> Channel error on connection <0.1184.0> (172.18.0.7:46758 -> 172.18.0.2:5672, vhost: '/', user: 'infrahub'), channel 2:
2024-09-10 07:21:13.354566+00:00 [error] <0.1210.0> operation none caused a channel exception precondition_failed: delivery acknowledgement on channel 2 timed out. Timeout value used: 30000 ms. This timeout value can be configured, see consumers doc guide to learn more

The message will then be picked up by another worker, which will likely in the same result. Eventually the generator will be reported as completed, but might not have completely finished.

The same situation seems to happen when the synchronization of an external git repository takes longer than 30s.

Expected Behavior

The consumer timeout should be increased for generators and repository sync process, or it should be a configurable setting.

Generators or repository synchronization might take longer than 30s to complete.

Steps to Reproduce

Additional Information

No response

wvandeun commented 1 month ago

There is a workaround by re declaring the rpcs queue manually. First we need to stop the server and git-worker containers

docker stop infrahub-infrahub-server-1 infrahub-infrahub-git-1 infrahub-infrahub-git-2

Then delete the existing rpcs queue and create it with a larger x-consumer-timeout value

docker exec -it infrahub-message-queue-1 rabbitmqadmin --username infrahub --password infrahub delete queue name=infrahub.rpcs
docker exec -it infrahub-message-queue-1 rabbitmqadmin --username infrahub --password infrahub declare queue name=infrahub.rpcs durable=true arguments='{"x-max-priority": 5, "x-consumer-timeout": 120000}'

Finally, start the server and git-worker containers again

docker start infrahub-infrahub-server-1 infrahub-infrahub-git-1 infrahub-infrahub-git-2

ogenstad commented 1 month ago

Fixed in #4384.

opsmill / infrahub