rabbitmq / cluster-operator

RabbitMQ Cluster Kubernetes Operator
https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Mozilla Public License 2.0
866 stars 270 forks source link

kubernetes operator not setting folder permissions before de-escalating permissions #1363

Open psarossy opened 1 year ago

psarossy commented 1 year ago

Describe the bug

This is similar to #1327 but with CephFS PVCs. Also at https://stackoverflow.com/questions/67771239/rabbitmq-fails-to-start-with-persistence-storage-on-kubernetes-permission-denie

The pod starts up but has no write access to the mnesia folder

I've deployed the standard example operator and test cluster from: https://rabbitmq.com/kubernetes/operator/quickstart-operator.html

The only modification I've added to the test cluster is that I set the storage-class.

To Reproduce

Steps to reproduce the behavior:

  1. kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
  2. enable persistence with a storage class
    apiVersion: rabbitmq.com/v1beta1
    kind: RabbitmqCluster
    metadata:
    name: hello-world
    spec:
    persistence:
    storageClassName: nvme-pool-ec62
    storage: 20Gi
  3. kubectl apply -f rabbitmq.yaml

Expected behavior

  1. pod, PVC, PV is provisioned
  2. pod attaches PV
  3. pod starts

At step 3. the process fails as the binary does not have write access to the persistence changes

Stream closed EOF for default/hello-world-server-0 (rabbitmq)                                                                                                                                            
rabbitmq 2023-05-23 19:30:58.187362+00:00 [warning] <0.132.0> Failed to write PID file "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default.pid": permission denied           
rabbitmq 2023-05-23 19:31:01.039621+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped                                                                                               
rabbitmq 2023-05-23 19:31:01.039898+00:00 [notice] <0.230.0> Logging: switching to configured handler(s); following messages may not be visible in this log output                                       
rabbitmq 2023-05-23 19:31:01.066987+00:00 [notice] <0.230.0> Logging: configured log handlers are now ACTIVE                                                                                             
rabbitmq                                                                                                                                                                                                 
rabbitmq BOOT FAILED                                                                                                                                                                                     
rabbitmq ===========                                                                                                                                                                                     
rabbitmq Error during startup: {error,                                                                                                                                                                   
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                                                                                                                                              
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0> BOOT FAILED                                                                                                                                  
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0> ===========                                                                                                                                  
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0> Error during startup: {error,                                                                                                                
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                           {cannot_create_mnesia_dir,                                                                                         
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                               "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",                             
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                               eacces}}                                                                                                       
rabbitmq                           {cannot_create_mnesia_dir,                                                                                                                                            
rabbitmq 2023-05-23 19:31:01.126017+00:00 [error] <0.230.0>                                                                                                                                              
rabbitmq                               "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",                                                                                
rabbitmq                               eacces}}                                                                                                                                                          
rabbitmq                                                                                                                                                                                                 
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>   crasher:                                                                                                                                   
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     initial call: application_master:init/4                                                                                                  
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     pid: <0.229.0>                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     registered_name: []                                                                                                                      
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     exception exit: {{cannot_create_mnesia_dir,                                                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                          "/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",                                  
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                          eacces},                                                                                                            
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                      {rabbit,start,[normal,]]}}                                                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>       in function  application_master:init/4 (application_master.erl, line 142)                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     ancestors: [<0.228.0>]                                                                                                                   
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     message_queue_len: 1                                                                                                                     
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     messages: [{'EXIT',<0.230.0>,normal}]                                                                                                    
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     links: [<0.228.0>,<0.44.0>]                                                                                                              
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     dictionary: []                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     trap_exit: true                                                                                                                          
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     status: running                                                                                                                          
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     heap_size: 610                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     stack_size: 28                                                                                                                           
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>     reductions: 178                                                                                                                          
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>   neighbours:                                                                                                                                
rabbitmq 2023-05-23 19:31:02.127328+00:00 [error] <0.229.0>                                                                                                                                              
rabbitmq 2023-05-23 19:31:02.143479+00:00 [notice] <0.44.0> Application rabbit exited with reason: {{cannot_create_mnesia_dir,"/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/",eacces},{rabbit,start,[normal,]]}}
rabbitmq {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{cannot_create_mnesia_dir,\"/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/\",eacces},{rabbit,start,[normal,]]}}}"} 
rabbitmq Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{cannot_create_mnesia_dir,"/var/lib/rabbitmq/mnesia/rabbit@hello-world-server-0.hello-world-nodes.default/", eacces},{rabbit,start,[normal,]]}}})                                                                                                                                                                     
rabbitmq                                                                                                                                                                                                 
rabbitmq Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done                                                                                                                         
Stream closed EOF for default/hello-world-server-0 (setup-container)

The volume that gets created is owned by root by default as with all other PVCs:

psarossy@artemis: ~/ceph/volumes/csi/csi-vol-f6c8aaf0-8c7b-4bdb-a7dd-fb514c9d3639/26f680c7-a460-4789-bb9d-9b085672b406
$ ls -al                                                                                                                                                                                         [16:11:13]
total 0
drwxr-xr-x 2 root root 0 May 23 11:29 .
drwxr-xr-x 3 root root 2 May 23 11:29 ..

If I change the folder ownership tot UID/GID 999:999 aka rabbitmq:rabbitmq then the pod starts up and works fine.

The statefulset is missing the command to claim the folder as part of init before handing over to the non-privileged user to start the process... Unfortunately this needs to be fixed in the operator as every pod has the same issue when new PVCs are created, as it'll overwrite any changes to the configs, rightfully so.

Version and environment information

lukebakken commented 1 year ago

The statefulset is missing the command to claim the folder as part of init before handing over to the non-privileged user to start the proces

It sounds like you understand the issue well. A pull request to fix it would be very welcome. Thanks.

psarossy commented 1 year ago

Did some more digging, the Helm recipe has a specific init container to fix this that can be enabled on demand, and with that the pods start up as expected, and work.

Tried to work on getting that option added, but I can't even get the code to build and pass tests without my modifications so gave up after like 2 hours...

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

Zerpet commented 1 year ago

Removing the stale label as this issue is legitimate. I recall some work around this last year or so, we'll have to dig up a bit the history to understand what changed and our motivation around the change.

jonathandavis805 commented 9 months ago

I'm running into this issue running cluster-operator v2.6.0 on an eks cluster version 1.28. I followed the docs and got this error: Failed to write PID file "/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-dev-server-0.rabbitmq-dev-nodes.rabbitmq-cluster-dev.pid": permission denied

jonathandavis805 commented 8 months ago

What resolved this for me was in the docs for Using Openshift

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  ...
spec:
  ...
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            securityContext: {}
psarossy commented 6 months ago

RE @jonathandavis805 That worked for me as well, but works because it removes some of statefulSet security settings :(

mkuratczyk commented 6 months ago

If you have this problem, please investigate what changes are necessary and share here. You can pause reconciliation and make modifications to the STS for example.

mlb5000 commented 2 months ago

I have this same problem. Does anyone have a solution? With @jonathandavis805 's it won't even try to start up, as it gets stuck at chown

chown: changing ownership of '/var/lib/rabbitmq/mnesia': Operation not permitted