osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

use quorum queues or streams for fanout/transient queues upstream #1110

Open artificial-intelligence opened 2 months ago

artificial-intelligence commented 2 months ago

So, currently there are different approaches upstream on what to do, let me try to document the current state here:

There is this bug: https://bugs.launchpad.net/kolla-ansible/+bug/2077448

There are different patchsets floating around. We basically have these mid term options:

move everything to streams

move everything to quorum queues

make it configurable by the user

make only some of it configurable (e.g. heat seems to need it and I bet if we take a closer look, more services actually will lose messages without it)

We currently have these patchsets:

https://review.opendev.org/c/openstack/kolla-ansible/+/927497 (by myself) uses quorum queues for all transient/fanout queues, passes basic CI, always on if quorum queues are configured

https://review.opendev.org/c/openstack/kolla-ansible/+/916911 (by mnasiadka) uses stream queues instead, doesn't pass CI, always on if quorum queues are configured, currently doesn't pass CI

https://review.opendev.org/c/openstack/kolla-ansible/+/924615 (by kevko) only  enables quorum queues for transient heat queues, it's always on, passes CI

https://review.opendev.org/c/openstack/kolla-ansible/+/924623 (by kevko) adds queue manager option to all services, which basically makes queue naming consistent, is depended on by the first and third patch, passes CI

This is also documented for the next two kolla upstream meetings here:

https://etherpad.opendev.org/p/KollaWhiteBoard#L72

But the Whiteboard upstream will get cleaned up so I wanted to have something more persistent to document the current state of the work and what decisions we need to make.

I'm not sure yet about quorum queues or streams, I need to research this topic a bit, but I think in either case we want to use it for all queues and maybe don't even make it possible for users to disable this, as afaik in some failure scenarios we currently lose messages from openstack services making the system as a whole less reliable than it could be.

artificial-intelligence commented 2 months ago

there's also this related patch:

https://review.opendev.org/c/openstack/kolla-ansible/+/907977 (by SvenKieske) only makes sure the precheck for all rabbitmq queues also checks transient and fanout queues, currently depends on the Patchset 927497 by SvenKieske