pulp / pulp-operator

Kubernetes Operator for Pulp 3. Under active development.
https://docs.pulpproject.org/pulp_operator/
GNU General Public License v2.0
67 stars 50 forks source link

pulp-worker CrashLoopBackOff #1338

Closed vkukk closed 1 month ago

vkukk commented 2 months ago

Version Please provide the versions of the pulp-operator and pulp images in use.

helm -n pulp list
NAME    NAMESPACE   REVISION    UPDATED                                     STATUS      CHART               APP VERSION 
pulp    pulp        1           2024-08-30 15:44:19.406654519 +0300 EEST    deployed    pulp-operator-0.1.0 1.0.1-beta.4

Describe the bug Worker pods wont start.

kubectl -n pulp logs pod/pulp-worker-674b7c5b99-876n4
Waiting on postgresql to start...
Postgres started.
Checking for database migrations
SystemCheckError: System check identified some issues:

ERRORS:
?: (files.E001) The FILE_UPLOAD_TEMP_DIR setting refers to the nonexistent directory '/var/lib/pulp/tmp'.
Database migrated!
pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
Traceback (most recent call last):
  File "/usr/local/bin/pulpcore-worker", line 8, in <module>
    sys.exit(worker())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py", line 43, in worker
    PulpcoreWorker().run(burst=burst)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py", line 462, in run
    with WorkerDirectory(self.name):
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 83, in __enter__
    self.create()
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 133, in create
    super().create()
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 40, in create
    os.makedirs(self.path, mode=self.MODE)
  File "/usr/lib64/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib64/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/var/lib/pulp/tmp'

To Reproduce Install latest version using Helm Using kubectl, install following Pulp CR using kubectl:

apiVersion: repo-manager.pulpproject.org/v1beta2
kind: Pulp
metadata:
  name: pulp
  namespace: pulp
spec:
  file_storage_storage_class: csi-cinder-high-speed
  file_storage_size: 100Gi
  file_storage_access_mode: "ReadWriteMany"
  database:
    postgres_storage_class: csi-cinder-high-speed
  cache:
    enabled: true
    redis_storage_class: csi-cinder-high-speed
  api:
    replicas: 2
    resource_requirements:
      requests:
        cpu: 250m
        memory: 256Mi
      limits:
        cpu: 1
        memory: 512Mi
  content:
    replicas: 2
    resource_requirements:
      requests:
        cpu: 250m
        memory: 256Mi
      limits:
        cpu: 500m
        memory: 512Mi
  worker:
    replicas: 2
    resource_requirements:
      requests:
        cpu: 250m
        memory: 256Mi
      limits:
        cpu: 2
        memory: 5120Mi

Expected behavior Workers should start

Additional context OVH Managed Kubernetes 1.30.2

vkukk commented 2 months ago

kubectl -n pulp describe pod pulp-worker-674b7c5b99-876n4

Controlled By:  ReplicaSet/pulp-worker-674b7c5b99
Init Containers:
  init-container:
    Container ID:    containerd://41bb16cc719bcbbf4acf78826012970532d035e01e7345f03a5b653816543bb0
    Image:           quay.io/pulp/pulp-minimal:stable
    Image ID:        quay.io/pulp/pulp-minimal@sha256:32a1ebbb9db57f71063f4b830aeca966c9a9995f889800bc95caf6e8eb95e2d3
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      /bin/sh
    Args:
      -c
      /usr/bin/wait_on_postgres.py
      /usr/bin/wait_on_database_migrations.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 30 Aug 2024 15:44:58 +0300
      Finished:     Fri, 30 Aug 2024 15:45:44 +0300
    Ready:          True
    Restart Count:  0
    Environment:
      POSTGRES_SERVICE_HOST:  pulp-database-svc
      POSTGRES_SERVICE_PORT:  5432
    Mounts:
      /etc/pulp/keys/database_fields.symmetric.key from pulp-db-fields-encryption (ro,path="database_fields.symmetric.key")
      /etc/pulp/settings.py from pulp-server (ro,path="settings.py")
      /var/lib/pulp from file-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-759n5 (ro)
Containers:
  worker:
    Container ID:    containerd://98d159179bb7cac84fde13cd2114c4c775a88575179d0cfccab4206246e88aff
    Image:           quay.io/pulp/pulp-minimal:stable
    Image ID:        quay.io/pulp/pulp-minimal@sha256:32a1ebbb9db57f71063f4b830aeca966c9a9995f889800bc95caf6e8eb95e2d3
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      /usr/bin/pulp-worker
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 30 Aug 2024 15:52:38 +0300
      Finished:     Fri, 30 Aug 2024 15:52:47 +0300
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     2
      memory:  5Gi
    Requests:
      cpu:      250m
      memory:   256Mi
    Readiness:  exec [/usr/bin/wait_on_postgres.py] delay=3s timeout=10s period=10s #success=1 #failure=1
    Environment:
      POSTGRES_SERVICE_HOST:  pulp-database-svc
      POSTGRES_SERVICE_PORT:  5432
      REDIS_SERVICE_HOST:     pulp-redis-svc.pulp
      REDIS_SERVICE_PORT:     6379
    Mounts:
      /.ansible/tmp from pulp-ansible-tmp (rw)
      /etc/pulp/keys/database_fields.symmetric.key from pulp-db-fields-encryption (ro,path="database_fields.symmetric.key")
      /etc/pulp/settings.py from pulp-server (ro,path="settings.py")
      /var/lib/pulp from file-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-759n5 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  pulp-server:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pulp-server
    Optional:    false
  pulp-db-fields-encryption:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pulp-db-fields-encryption
    Optional:    false
  pulp-ansible-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  file-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pulp-file-storage
    ReadOnly:   false
  kube-api-access-759n5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
git-hyagi commented 2 months ago

Hi @vkukk

Would you mind providing the output from the following commands?

kubectl exec -it deployment/pulp-api -- ls -la /var/lib/pulp
kubectl exec -it deployment/pulp-api -- id
kubectl exec -it deployment/pulp-worker -- ls -la /var/lib/pulp
kubectl exec -it deployment/pulp-worker -- id
vkukk commented 2 months ago

I've attempted to install Pulp several times. It seems now it is getting stuck at not being able to attach volume to worker pod.

$ kubectl -n pulp exec -it deployment/pulp-api -- ls -la /var/lib/pulp
total 28
drwxr-xr-x 4 root root  4096 Sep  3 13:55 .
drwxr-xr-x 1 root root  4096 Aug 28 01:44 ..
drwxr-xr-x 3 root root  4096 Sep  3 13:55 .local
drwx------ 2 root root 16384 Sep  3 13:54 lost+found
$ kubectl -n pulp exec -it deployment/pulp-api -- id
uid=700(pulp) gid=700(pulp) groups=700(pulp)
kubectl -n pulp describe pod pulp-worker-6f6f7484c7-qklxk
...
  Warning  FailedAttachVolume  84s                    attachdetach-controller  AttachVolume.Attach failed for volume "ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5" : rpc error: code = Internal desc = [ControllerPublishVolume] Attach Volume failed with error failed to attach 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 volume to 1936c581-5e44-4036-b1e9-9c408b6048cb compute: Bad request with: [POST https://compute.gra5.cloud.ovh.net/v2.1/4058bdfd71674fa0afb69dbe0d63ae85/servers/1936c581-5e44-4036-b1e9-9c408b6048cb/os-volume_attachments], error message: {"badRequest": {"code": 400, "message": "Invalid input received: Invalid volume: Volume 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 status must be available or downloading to reserve, but the current status is in-use. (HTTP 400) (Request-ID: req-57aa1c28-4b14-45c3-9426-d31395d1be34)"}}
$ kubectl -n pulp describe pvc pulp-file-storage 
Name:          pulp-file-storage
Namespace:     pulp
StorageClass:  csi-cinder-high-speed
Status:        Bound
Volume:        ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Labels:        app.kubernetes.io/component=storage
               app.kubernetes.io/managed-by=pulp-operator
               app.kubernetes.io/part-of=pulp
               pulp_cr=pulp
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: cinder.csi.openstack.org
               volume.kubernetes.io/storage-provisioner: cinder.csi.openstack.org
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      100Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Used By:       pulp-api-7d45f95595-th475
               pulp-content-67d4874c45-q6bcm
               pulp-worker-6f6f7484c7-qklxk
Events:
  Type    Reason                 Age    From                                                                                                       Message
  ----    ------                 ----   ----                                                                                                       -------
  Normal  Provisioning           8m30s  cinder.csi.openstack.org_csi-cinder-controllerplugin-8464d84df-qsdbj_9994f747-dbf3-4a0a-981e-377575d86266  External provisioner is provisioning volume for claim "pulp/pulp-file-storage"
  Normal  ExternalProvisioning   8m30s  persistentvolume-controller                                                                                Waiting for a volume to be created either by the external provisioner 'cinder.csi.openstack.org' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal  ProvisioningSucceeded  8m30s  cinder.csi.openstack.org_csi-cinder-controllerplugin-8464d84df-qsdbj_9994f747-dbf3-4a0a-981e-377575d86266  Successfully provisioned volume ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5

Persistent volume itself:

$ kubectl -n pulp describe pv ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Name:              ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: cinder.csi.openstack.org
                   volume.kubernetes.io/provisioner-deletion-secret-name: 
                   volume.kubernetes.io/provisioner-deletion-secret-namespace: 
Finalizers:        [kubernetes.io/pv-protection external-attacher/cinder-csi-openstack-org]
StorageClass:      csi-cinder-high-speed
Status:            Bound
Claim:             pulp/pulp-file-storage
Reclaim Policy:    Delete
Access Modes:      RWX
VolumeMode:        Filesystem
Capacity:          100Gi
Node Affinity:     
  Required Terms:  
    Term 0:        topology.cinder.csi.openstack.org/zone in [nova]
Message:           
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            cinder.csi.openstack.org
    FSType:            ext4
    VolumeHandle:      2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1724323204382-7413-cinder.csi.openstack.org
Events:                <none>
vkukk commented 2 months ago

After deleting the pulp-worker-6f6f7484c7-qklxk, new worker pod was created and it succeeded in mounting volume. I also managed to run required commands between crashloops.

$ kubectl -n pulp logs pulp-worker-6f6f7484c7-cncxc
Waiting on postgresql to start...
Postgres started.
Checking for database migrations
SystemCheckError: System check identified some issues:

ERRORS:
?: (files.E001) The FILE_UPLOAD_TEMP_DIR setting refers to the nonexistent directory '/var/lib/pulp/tmp'.
Database migrated!
pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
Traceback (most recent call last):
  File "/usr/local/bin/pulpcore-worker", line 8, in <module>
    sys.exit(worker())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py", line 43, in worker
    PulpcoreWorker().run(burst=burst)
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py", line 462, in run
    with WorkerDirectory(self.name):
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 83, in __enter__
    self.create()
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 133, in create
    super().create()
  File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 40, in create
    os.makedirs(self.path, mode=self.MODE)
  File "/usr/lib64/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/usr/lib64/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/var/lib/pulp/tmp'
$ kubectl -n pulp exec -it deployment/pulp-worker -- ls -la /var/lib/pulp
total 28
drwxr-xr-x 4 root root  4096 Sep  3 13:55 .
drwxr-xr-x 1 root root  4096 Aug 28 01:44 ..
drwxr-xr-x 3 root root  4096 Sep  3 13:55 .local
drwx------ 2 root root 16384 Sep  3 13:54 lost+found
$ kubectl -n pulp exec -it deployment/pulp-worker -- id
uid=700(pulp) gid=700(pulp) groups=700(pulp)
git-hyagi commented 1 month ago

Hi @vkukk

Unfortunately, I don't have access to a OVH k8s cluster to run some tests and investigate this further, but here are some things that I noticed:

1. from OVH docs, this storage class does not support RWX: https://help.ovhcloud.com/csm/en-ie-public-cloud-kubernetes-set-up-persistent-volume?id=kb_article_view&sysparm_article=KB0049964#access-modes "Our storage resource, Cinder, doesn't allow to mount a PV on several nodes at the same time, so you need to use the ReadWriteOnce access mode."

which probably explains the AttachVolume error:

  Warning  FailedAttachVolume  84s                    attachdetach-controller  AttachVolume.Attach failed for volume "ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5" : rpc error: code = Internal desc = [ControllerPublishVolume] Attach Volume failed with error failed to attach 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 volume to 1936c581-5e44-4036-b1e9-9c408b6048cb compute: Bad request with: [POST https://compute.gra5.cloud.ovh.net/v2.1/4058bdfd71674fa0afb69dbe0d63ae85/servers/1936c581-5e44-4036-b1e9-9c408b6048cb/os-volume_attachments], error message: {"badRequest": {"code": 400, "message": "Invalid input received: Invalid volume: Volume 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 status must be available or downloading to reserve, but the current status is in-use. (HTTP 400) (Request-ID: req-57aa1c28-4b14-45c3-9426-d31395d1be34)"}}

2. the permission error seems to be a known issue on OVHcloud Managed Kubernetes: https://help.ovhcloud.com/csm/en-public-cloud-kubernetes-persistentvolumes-permission-errors?id=kb_article_view&sysparm_article=KB0049758

so, to workaround problem 1, I can think of the following paths:

for problem 2, it seems like OVH team is working on a patch for this error: https://help.ovhcloud.com/csm/en-public-cloud-kubernetes-persistentvolumes-permission-errors?id=kb_article_view&sysparm_article=KB0049758#provided-solutions

vkukk commented 1 month ago

Got pulp-operator working with S3.

vkukk commented 1 month ago

I had to revert back to using OVH Cinder volumes for file storage. Since OVH does not support ReadWriteMany, but onlyReadWriteOnce, to use shared volume, all pulp-api, pulp-worker and pulp-content pods must be on the same node that has this volume mounted.

Suggestion to use node_selector did not suit me well because that would meant active management of node labels, making only single node usable for those pods at a time. Not very HA approach. I wanted all those pods to be on the same node as are all the other pods that need to share the volume. Not on a specific labeled node but on any node having any of these pods.

Using podAffinity works well for gathering all pods together on single node. If that node goes down, all pods move to some other node. But they always keep together.

  worker:
    affinity:
      podAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - pulp-api
              - pulp-worker
              - pulp-content
          topologyKey: kubernetes.io/hostname

Pods that have app.kubernetes.io value matched will be aggregated together based on topologyKey hostname.