rabbitmq / cluster-operator

RabbitMQ Cluster Kubernetes Operator
https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Mozilla Public License 2.0
872 stars 273 forks source link

Enable all feature flags on upgrade #1240

Open MirahImage opened 1 year ago

MirahImage commented 1 year ago

The default behavior of the cluster operator should be to enable all feature flags after an upgrade as an additional PostDeploy step.

Currently, all feature flags are enabled when a cluster is created, but they are never again enabled. Should a new feature flag be added during an upgrade, then that feature flag will not currently automatically be enabled. This could cause future upgrades to fail without manual intervention to enable the feature flags.

This behavior could be disabled by disabling the PostDeploy steps, much like the queue rebalance.

ansd commented 1 year ago

I think it's safer for a human operator to decide when to enable what kind of feature flag. Enabling a feature flag could - depending on the migration function of the feature flag - pause certain operations in RabbitMQ:

As an operator, the most important part of this procedure to remember is that if the migration takes time, some components and thus some operations in RabbitMQ might be blocked during the migration.

However, having on opt-in (or opt-out) option sounds reasonable.

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

ftdcn commented 1 year ago

Sometimes it's not as easy as "I think it's safer for a human operator to decide when to enable what kind of feature flag.".

Current deployment scenarios might have one team take care of the operating system layer (including updates of installed packates) and another team might be responsible for the application layer and service configurations. So running a "yum update" or "apt upgrade" should not break the application. Furthermore I can not see a way to fix the updated package / configuration once the update caused the service to not start. One can not just simply update rabbitmq and enable new feature flags afterwards as the service might just not start after the update.

I understand feature flags shoud be enabled on purpose by someone who understands what's going on but on the other hand the service should start even after the service binaries got updated. This is causing quite some hassle for us at the moment as we either need to drop the whole complex configuration including users and password or (the way we do this at the moment): roll back rabbitmq to an earlier version, enable all feature flags and then update again. this of course causes quite some downtime which is just not right ... come on guys ... you can do better than that!

ansd commented 1 year ago

Auto enabling feature flags might also be better implemented in rabbitmq-server itself: https://github.com/rabbitmq/rabbitmq-server/issues/5212

mundus08 commented 1 year ago

I have been using rabbitmq successfully for many years - but the behavior of the feature flags annoys me a lot. I updated my server without setting the feature flags first. The updated server no longer starts. _rabbitmqctl enable_featureflag all seams to works only when the server is running. And actually I just wanted to make a problem-free update (like in all the years before) Just my 2 cents from a very satisfied RabbitMQ user

mkuratczyk commented 1 year ago

@mundus08 what would be your suggestion?

  1. if we don't have a mechanism like that, we can't change certain things (sometimes even fix bugs), because all nodes in the cluster need to behave the same way, so either no evolution or no rolling upgrades
  2. if we automatically enable all feature flags after an upgrade, downgrades will be impossible (they are not supported, but people do use them, especially when there are post upgrade issues)
  3. to never enforce feature flags, we would need to maintain backwards compatibility forever, which is super hard

Asking users to run one command once they are confident the upgrade succeeded would seem like a reasonable compromise...

We can consider options such us enabling all flags automatically on the next upgrade, after an FF was introduced for example. Say you upgrade from x.y.z to x.y+1.0 and there's a new feature flag. If you then upgrade to x.y+1.1, we would automatically enable all FFs introduced in x.y+1.0. Basically assuming that since you upgraded again, the previous upgrade must have been successful. This still wouldn't solve all upgrade paths, but would make it easier for those who upgrade regularly. The drawback is that some FFs could have an expensive migration process, which would be automatically triggered and could surprise users in a different way...

mundus08 commented 1 year ago

@mkuratczyk Please excuse the late reply. I'm probably not a typical user as I'm only running a single node installation so I can't evaluate the different options. My expectation would be that after an unattended update (I use Ansible to update all my Debian servers) the RabbitMQ server would be in a stable state. If necessary, an update should not be carried out if the server can then no longer be started due to a missing feature. Since the upgrade was automated, I can not reproduce if there was a warning that Ansible ignored.

mkuratczyk commented 1 year ago

This can't be fully guaranteed for all cases in a stateful service. Imagine you have an old version running, with some data on disk. You decide to start managing your machine differently and keep it up to date regularly (say, with Ansible). Suddenly you upgrade from, say, 3.8 to 3.12. Some on-disk representation changed and 3.12 can no longer read data stored by 3.8. On a dev machine, the easiest solution is to delete the data, which is something you can totally add to your scripts if that's acceptable for you, but not something we can just add to happen by default (obviously many users do care about their data).

We are looking at options to make upgrades simpler and have this kind of issues occur less often, but you can't expect distributed stateful services to just always upgrade successfully unattended. There's a reason so many people want to use cloud/managed data services - they effectively outsource such concerns to somebody else. :)

mkuratczyk commented 1 year ago

One clarification: I forgot this is in the context of the Operator. In this case, there are some additional/different considerations. Ideally, the Operator would indeed prevent such upgrades. The problem is that at least with the current design, surprisingly, the Operator doesn't know what version it's upgrading to. The only source of such information is the image tag, which is reliable in most, but not all cases. For example some users relocate publicly available images to their local registries and change the tags in the process (the tag may not contain the version at all). Another example, when we perform tests as part of RabbitMQ development, we use images with branch names or commit SHAs instead of versions. People may also have floating tags (eg. 3-management).

Having said that, I agree that it'd be nice to add such functionality to the Operator. It could behave the same way as it does currently, when it can't find the exact version in the tag and be smarter when it does (it'd need to assume the image tag doesn't lie).

chintal commented 1 year ago

@mkuratczyk

I see the need to force human intervention in the upgrade process as well as the fact that the agents ( apt / ansible / operator / ... ) don't have the needed visibility to enable the required flags or halt the upgrade. I also agree that the change, whatever the change might be, does need to happen on rabbitmq-server ( https://github.com/rabbitmq/rabbitmq-server/issues/5212 )

However, I still would like to pile on and say that a regular upgrade process should not irrecoverably break running services.

I got hit with this just now on my development machine because of a simple apt upgrade. The upgrade, entirely non-specific to rabbitmq, resulted in a non-functioning and not trivially recoverable single node cluster. This wasn't even an apt full-upgrade or dist-upgrade or whatever it's called lately - which typically is where possibly breaking changes should come from.

While I can nuke the stored data and start over here, or install the previous version using apt and enable the flags, this circumstance fills me with fear for how the in-kubernetes deployment will fare if the image were to be updated for any reason. The downtime associated with that is likely going to be significantly longer, and more importantly, exponentially more expensive.

To make matters worse, the first time I heard of rabbitmq feature flags is when the upgrade broke and I looked at the logs.

As an example, postgresql also has potentially incompatible data formats between versions. Despite this, a postgres cluster does not break during upgrade. Yes, it does require some manual work to upgrade the cluster afterwards and it probably has the wrong / multiple versions running until you do this, but in over a decade of running postgres, I have never had breaking failures when naively upgrading postgres along with the system it is in.

At the minimum, it should be made possible for feature_flags to be enabled on an offline cluster. Those admins who have kept up with the flags will get a seamless transition, and those who leave it to apt or the operator or similar will have a clean way to recover. If this results in data loss, a suitable warning can be provided at the time, with a suggestion to roll back the version first in case the data is important.

tspspi commented 1 year ago

Just happened to me during an upgrade on a machine using it's packaging system - updated a (single node) node from 3.8 to 3.11 - unable to start. Also unable to downgrade since this would break all other packages due to dependencies. There has to be a way to enable feature flags without being able to run the node to get into a running state again? (If starting over - downgrading is not possible - is the only solution I think it's time to look for alternative MQs that allow to be repaired in case of errors ...)

mkuratczyk commented 1 year ago
  1. This repo is about the Kubernetes Operator, which I don't think you are using
  2. Upgrading directly from 3.8 to 3.11 is not supported in the first place.
  3. If it is a dev machine, just delete the data folder (/var/lib/rabbitmq/* or whatever is is for you) and that's it - you will start a fresh 3.12 instance
  4. If you can't do the above, you can try downgrading, manually editing the feature_flags file (in the RabbitMQ data directory) to see if you can start 3.8 this way

If you want to spend the time looking for a different messaging system - go ahead, but you can also start contributing to the project you already rely on. That's what open source is about.

Zerpet commented 1 year ago

I have an idea that can be middleground between all opinions expressed in this issue. The Cluster Operator already has CONTROL_RABBITMQ_IMAGE env variable that does the following (quotting docs):

EXPERIMENTAL! When this is set to true, the operator will always automatically set the default image tags. This can be used to automate the upgrade of RabbitMQ clusters, when the Operator is upgraded. Note there are no safety checks performed, nor any compatibility checks between RabbitMQ versions.

We could extend the behaviour of this variable to also always enable feature flags. This behaviour will be considered experimental, as the actual behaviour of this env variable. My argument for this suggestion is that automatically enabling all feature flags, after every upgrade, is sort of "hands free" or "auto-pilot" management of rabbitmq, which is the same as automatically changing the RabbitMQ image.

alphamonkey79 commented 4 months ago

Hello,

Will rabbitmqctl enable_feature_flag all enable ALL feature flags including features RabbitMQ has designated as experimental?

Update... https://www.rabbitmq.com/docs/feature-flags#how-to-enable-feature-flags INFO: The rabbitmqctl enable_feature_flag all command enables stable feature flags only and not experimental ones.

MirahImage commented 4 months ago

No, it only enables stable feature flags and does not enable experimental feature flags. https://www.rabbitmq.com/docs/feature-flags#how-to-enable-feature-flags

yannic-hamann-abb commented 2 months ago

I created a rabbitmq cluster for version 3.13.1 through the operator in march this year. When upgrading to 3.13.6 everything seemed to work normally, the upgrade progress was smooth, no errors and the new pods of the sts started without any issues.

When logging into the mgmt interface I was greeted with the warning All stable feature flags must be enabled after completing an upgrade. Without enabling all flags, upgrading to future minor or major versions of RabbitMQ may not be possible. That's why I landed here. It seemed that a new feature message_containers_deaths_v2 has been introduced after 3.13.1. I could manually enable the feature within the mgmt interface.

I think that's exactly the point of feature flags. Incrementally enabling them making sure that the environment isn't effected by newly introduced features. Automatically enabling all feature flags after a upgrade kind of defeats that point, right?

mkuratczyk commented 2 months ago

The main goal of feature flags (in the RabbitMQ context, the term is used to mean other things in other products) is to allow rolling upgrades between versions that introduce changes to some functionality that all cluster nodes need to agree about. Taking quorum queues as an example, since any queue operation is replicated to all quorum queue members, we can't allow that during the upgrade, the first QQ member that was upgraded sends something to the other two members that they can't yet understand since they are still running the previous version. Feature flags allow to upgrade all nodes first, and only once all nodes/members are on the new version, the new behaviour is enabled.

Despite the name, feature flags don't necessarily control new features as such, sometimes they are needed to introduce a bugfix, like in the case of message_containers_deaths_v2, where all nodes needs to behave the same way for the functionality to work.

Each RabbitMQ version introduces many changes that are not gated by feature flags (simply because it's ok for different nodes to behave differently during the upgrade), so most of the risk is in the upgrade itself, not in enabling the feature flags later. The changes gated by FFs are not necessarily "bigger" or more important, they just happen to be changes to a functionality that needs to be identical between all nodes.

If you upgrade to a version with many new FFs, it's ok to enable them incrementally. However, as clearly stated in https://www.rabbitmq.com/docs/feature-flags, they should not be treated like configuration: "I don't need feature X and therefore won't enable feature flag related to X" or something like that. All non-experimental feature flags should be enabled after each upgrade.

The main reason the Operator doesn't simply enable all feature flags after each upgrade is that once the FFs are enabled, you can't downgrade, because the new behaviour is already on and a node running an older version wouldn't be able to handle this new functionality ("functionality" can mean some internal API that users are not really aware of). While downgrades are not officially supported/tested, we know some users perform downgrades if the new version introduced a regression or something. By enabling all FFs immediately after the upgrade, we would effectively take away this option. But again - as soon as you deem the upgrade successful, you should enable all FFs (one by one is fine if you want), otherwise you likely don't really have some of the fixes of the new version. Sooner or later, all feature flags will be required so you can only delay enabling them up to a point.