rabbitmq / cluster-operator

RabbitMQ Cluster Kubernetes Operator
https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Mozilla Public License 2.0
872 stars 272 forks source link

Feature flag disable feature is not working #688

Closed Arn-Arm closed 3 years ago

Arn-Arm commented 3 years ago

Describe the bug

In order to try mitigate issue described here we want disable quorum queues completely which are enabled by default. Documentation describes that it can be done using environment variable(inside envConfig and/or env) RABBITMQ_FEATURE_FLAGS or forced_feature_flags_on_init configuration parameter. After trying both methods i can see that quorum_queue feature is enabled and environment variable and/or configuration are present.

To make sure that feature was not disabled i have checked RabbitMQ logs and double checked with the rabbitmqcli tool.

Snippet from the logs

...
│ 2021-05-12 10:58:39.832 [info] <0.1082.0> Feature flag `quorum_queue`: supported, attempt to enable...                                                                       
│ 2021-05-12 10:58:39.832 [info] <0.1082.0> Feature flag `quorum_queue`: mark as enabled=state_changing                                                                        
│ 2021-05-12 10:58:39.847 [info] <0.1082.0> Feature flags: list of feature flags found:                                                                                        
│ 2021-05-12 10:58:39.847 [info] <0.1082.0> Feature flags:   [x] drop_unroutable_metric                                                                                        
│ 2021-05-12 10:58:39.847 [info] <0.1082.0> Feature flags:   [x] empty_basic_get_metric                                                                                        
│ 2021-05-12 10:58:39.847 [info] <0.1082.0> Feature flags:   [x] implicit_default_bindings                                                                                     
│ 2021-05-12 10:58:39.848 [info] <0.1082.0> Feature flags:   [~] quorum_queue                                                                                                  
│ 2021-05-12 10:58:39.848 [info] <0.1082.0> Feature flags:   [ ] virtual_host_metadata                                                                                         
│ 2021-05-12 10:58:39.848 [info] <0.1082.0> Feature flags: feature flag states written to disk: yes
...

Rabbitmq CLI output

...
rabbitmq@rabbitmq-cluster-server-0:/$ rabbitmqctl list_feature_flags
Listing feature flags ...
name    state
drop_unroutable_metric  enabled
empty_basic_get_metric  enabled
implicit_default_bindings   enabled
quorum_queue    enabled
virtual_host_metadata   enabled
...

To Reproduce

Steps to reproduce the behavior:

  1. Deploy following RabbitMQ yaml file
---
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-cluster
spec:
  image: rabbitmq:3.8.5
  replicas: 3
  persistence:
    storage: 1Gi
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                env:
                  - name: RABBITMQ_FEATURE_FLAGS
                    value: drop_unroutable_metric,empty_basic_get_metric,implicit_default_bindings,virtual_host_metadata
  rabbitmq:
    additionalPlugins:
      - rabbitmq_amqp1_0
    envConfig: |
     RABBITMQ_FEATURE_FLAGS=drop_unroutable_metric,empty_basic_get_metric,implicit_default_bindings,virtual_host_metadata
    advancedConfig: | 
      [
        {
        rabbit,[{
          forced_feature_flags_on_init, [drop_unroutable_metric, empty_basic_get_metric, implicit_default_bindings,virtual_host_metadata]
        }]
        }
      ].

Expected behavior quorum_queues feature flag is disabled.

Version and environment information

ChunyiLyu commented 3 years ago

Hi @Arn-Arm, it's not necessarily the use of quorum queues that's causing pod getting stuck on termination. On pod termination, cluster operator runs three commands in the preStop container lifecycle hooks: rabbitmq-upgrade await_online_quorum_plus_one, rabbitmq-upgrade await_online_synchronized_mirror, and rabbitmq-upgrade drain (see the logic here: https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/statefulset.go#L679). First two commands ensure that all quorum queues and mirrored queues on this particular rabbit node have quorum (or are synced). You can run these command on your stuck pod to understand which command is preventing termination. If it's indeed the quorum queue command that's causing issues, you can check quorum status for each queue to figure out which queue is the problem by running rabbitmq-queues quorum_status (https://www.rabbitmq.com/rabbitmq-queues.8.html#quorum_status)

A workaround to disable container preStop checks is to set a short timeout for the preStop container lifecycle hook (see: https://www.rabbitmq.com/kubernetes/operator/using-operator.html#TerminationGracePeriodSeconds). The default timeout is a week long (for data safety), you can configure the timeout to 0 second or a short time to get around it. However, this means that the checks might not be performed successfully at termination and you are at greater risk of data loss.

For your request to disable quorum queue feature flags, cluster operator does not support disabling any feature flag. I will update our documentation to make it clear. Let us know what you find. I think pods stuck at termination can and should be fixed without disabling any feature flag.

ChunyiLyu commented 3 years ago

Closing because this is not an issue/bug.