Closed vkukk closed 1 month ago
kubectl -n pulp describe pod pulp-worker-674b7c5b99-876n4
Controlled By: ReplicaSet/pulp-worker-674b7c5b99
Init Containers:
init-container:
Container ID: containerd://41bb16cc719bcbbf4acf78826012970532d035e01e7345f03a5b653816543bb0
Image: quay.io/pulp/pulp-minimal:stable
Image ID: quay.io/pulp/pulp-minimal@sha256:32a1ebbb9db57f71063f4b830aeca966c9a9995f889800bc95caf6e8eb95e2d3
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
/bin/sh
Args:
-c
/usr/bin/wait_on_postgres.py
/usr/bin/wait_on_database_migrations.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 30 Aug 2024 15:44:58 +0300
Finished: Fri, 30 Aug 2024 15:45:44 +0300
Ready: True
Restart Count: 0
Environment:
POSTGRES_SERVICE_HOST: pulp-database-svc
POSTGRES_SERVICE_PORT: 5432
Mounts:
/etc/pulp/keys/database_fields.symmetric.key from pulp-db-fields-encryption (ro,path="database_fields.symmetric.key")
/etc/pulp/settings.py from pulp-server (ro,path="settings.py")
/var/lib/pulp from file-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-759n5 (ro)
Containers:
worker:
Container ID: containerd://98d159179bb7cac84fde13cd2114c4c775a88575179d0cfccab4206246e88aff
Image: quay.io/pulp/pulp-minimal:stable
Image ID: quay.io/pulp/pulp-minimal@sha256:32a1ebbb9db57f71063f4b830aeca966c9a9995f889800bc95caf6e8eb95e2d3
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
/usr/bin/pulp-worker
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 30 Aug 2024 15:52:38 +0300
Finished: Fri, 30 Aug 2024 15:52:47 +0300
Ready: False
Restart Count: 6
Limits:
cpu: 2
memory: 5Gi
Requests:
cpu: 250m
memory: 256Mi
Readiness: exec [/usr/bin/wait_on_postgres.py] delay=3s timeout=10s period=10s #success=1 #failure=1
Environment:
POSTGRES_SERVICE_HOST: pulp-database-svc
POSTGRES_SERVICE_PORT: 5432
REDIS_SERVICE_HOST: pulp-redis-svc.pulp
REDIS_SERVICE_PORT: 6379
Mounts:
/.ansible/tmp from pulp-ansible-tmp (rw)
/etc/pulp/keys/database_fields.symmetric.key from pulp-db-fields-encryption (ro,path="database_fields.symmetric.key")
/etc/pulp/settings.py from pulp-server (ro,path="settings.py")
/var/lib/pulp from file-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-759n5 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
pulp-server:
Type: Secret (a volume populated by a Secret)
SecretName: pulp-server
Optional: false
pulp-db-fields-encryption:
Type: Secret (a volume populated by a Secret)
SecretName: pulp-db-fields-encryption
Optional: false
pulp-ansible-tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
file-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pulp-file-storage
ReadOnly: false
kube-api-access-759n5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
Hi @vkukk
Would you mind providing the output from the following commands?
kubectl exec -it deployment/pulp-api -- ls -la /var/lib/pulp
kubectl exec -it deployment/pulp-api -- id
kubectl exec -it deployment/pulp-worker -- ls -la /var/lib/pulp
kubectl exec -it deployment/pulp-worker -- id
I've attempted to install Pulp several times. It seems now it is getting stuck at not being able to attach volume to worker pod.
$ kubectl -n pulp exec -it deployment/pulp-api -- ls -la /var/lib/pulp
total 28
drwxr-xr-x 4 root root 4096 Sep 3 13:55 .
drwxr-xr-x 1 root root 4096 Aug 28 01:44 ..
drwxr-xr-x 3 root root 4096 Sep 3 13:55 .local
drwx------ 2 root root 16384 Sep 3 13:54 lost+found
$ kubectl -n pulp exec -it deployment/pulp-api -- id
uid=700(pulp) gid=700(pulp) groups=700(pulp)
kubectl -n pulp describe pod pulp-worker-6f6f7484c7-qklxk
...
Warning FailedAttachVolume 84s attachdetach-controller AttachVolume.Attach failed for volume "ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5" : rpc error: code = Internal desc = [ControllerPublishVolume] Attach Volume failed with error failed to attach 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 volume to 1936c581-5e44-4036-b1e9-9c408b6048cb compute: Bad request with: [POST https://compute.gra5.cloud.ovh.net/v2.1/4058bdfd71674fa0afb69dbe0d63ae85/servers/1936c581-5e44-4036-b1e9-9c408b6048cb/os-volume_attachments], error message: {"badRequest": {"code": 400, "message": "Invalid input received: Invalid volume: Volume 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 status must be available or downloading to reserve, but the current status is in-use. (HTTP 400) (Request-ID: req-57aa1c28-4b14-45c3-9426-d31395d1be34)"}}
$ kubectl -n pulp describe pvc pulp-file-storage
Name: pulp-file-storage
Namespace: pulp
StorageClass: csi-cinder-high-speed
Status: Bound
Volume: ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Labels: app.kubernetes.io/component=storage
app.kubernetes.io/managed-by=pulp-operator
app.kubernetes.io/part-of=pulp
pulp_cr=pulp
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: cinder.csi.openstack.org
volume.kubernetes.io/storage-provisioner: cinder.csi.openstack.org
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 100Gi
Access Modes: RWX
VolumeMode: Filesystem
Used By: pulp-api-7d45f95595-th475
pulp-content-67d4874c45-q6bcm
pulp-worker-6f6f7484c7-qklxk
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 8m30s cinder.csi.openstack.org_csi-cinder-controllerplugin-8464d84df-qsdbj_9994f747-dbf3-4a0a-981e-377575d86266 External provisioner is provisioning volume for claim "pulp/pulp-file-storage"
Normal ExternalProvisioning 8m30s persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'cinder.csi.openstack.org' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal ProvisioningSucceeded 8m30s cinder.csi.openstack.org_csi-cinder-controllerplugin-8464d84df-qsdbj_9994f747-dbf3-4a0a-981e-377575d86266 Successfully provisioned volume ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Persistent volume itself:
$ kubectl -n pulp describe pv ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Name: ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5
Labels: <none>
Annotations: pv.kubernetes.io/provisioned-by: cinder.csi.openstack.org
volume.kubernetes.io/provisioner-deletion-secret-name:
volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers: [kubernetes.io/pv-protection external-attacher/cinder-csi-openstack-org]
StorageClass: csi-cinder-high-speed
Status: Bound
Claim: pulp/pulp-file-storage
Reclaim Policy: Delete
Access Modes: RWX
VolumeMode: Filesystem
Capacity: 100Gi
Node Affinity:
Required Terms:
Term 0: topology.cinder.csi.openstack.org/zone in [nova]
Message:
Source:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: cinder.csi.openstack.org
FSType: ext4
VolumeHandle: 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1
ReadOnly: false
VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1724323204382-7413-cinder.csi.openstack.org
Events: <none>
After deleting the pulp-worker-6f6f7484c7-qklxk, new worker pod was created and it succeeded in mounting volume. I also managed to run required commands between crashloops.
$ kubectl -n pulp logs pulp-worker-6f6f7484c7-cncxc
Waiting on postgresql to start...
Postgres started.
Checking for database migrations
SystemCheckError: System check identified some issues:
ERRORS:
?: (files.E001) The FILE_UPLOAD_TEMP_DIR setting refers to the nonexistent directory '/var/lib/pulp/tmp'.
Database migrated!
pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
Traceback (most recent call last):
File "/usr/local/bin/pulpcore-worker", line 8, in <module>
sys.exit(worker())
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py", line 43, in worker
PulpcoreWorker().run(burst=burst)
File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/worker.py", line 462, in run
with WorkerDirectory(self.name):
File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 83, in __enter__
self.create()
File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 133, in create
super().create()
File "/usr/local/lib/python3.9/site-packages/pulpcore/tasking/storage.py", line 40, in create
os.makedirs(self.path, mode=self.MODE)
File "/usr/lib64/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/lib64/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/var/lib/pulp/tmp'
$ kubectl -n pulp exec -it deployment/pulp-worker -- ls -la /var/lib/pulp
total 28
drwxr-xr-x 4 root root 4096 Sep 3 13:55 .
drwxr-xr-x 1 root root 4096 Aug 28 01:44 ..
drwxr-xr-x 3 root root 4096 Sep 3 13:55 .local
drwx------ 2 root root 16384 Sep 3 13:54 lost+found
$ kubectl -n pulp exec -it deployment/pulp-worker -- id
uid=700(pulp) gid=700(pulp) groups=700(pulp)
Hi @vkukk
Unfortunately, I don't have access to a OVH k8s cluster to run some tests and investigate this further, but here are some things that I noticed:
1. from OVH docs, this storage class does not support RWX: https://help.ovhcloud.com/csm/en-ie-public-cloud-kubernetes-set-up-persistent-volume?id=kb_article_view&sysparm_article=KB0049964#access-modes "Our storage resource, Cinder, doesn't allow to mount a PV on several nodes at the same time, so you need to use the ReadWriteOnce access mode."
which probably explains the AttachVolume
error:
Warning FailedAttachVolume 84s attachdetach-controller AttachVolume.Attach failed for volume "ovh-managed-kubernetes-oys8ou-pvc-5d3d4a86-68f3-4f97-ab67-ee8c7e2c66e5" : rpc error: code = Internal desc = [ControllerPublishVolume] Attach Volume failed with error failed to attach 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 volume to 1936c581-5e44-4036-b1e9-9c408b6048cb compute: Bad request with: [POST https://compute.gra5.cloud.ovh.net/v2.1/4058bdfd71674fa0afb69dbe0d63ae85/servers/1936c581-5e44-4036-b1e9-9c408b6048cb/os-volume_attachments], error message: {"badRequest": {"code": 400, "message": "Invalid input received: Invalid volume: Volume 2eef19fa-ce32-4ea0-8cf4-7cf73b4526b1 status must be available or downloading to reserve, but the current status is in-use. (HTTP 400) (Request-ID: req-57aa1c28-4b14-45c3-9426-d31395d1be34)"}}
2. the permission error seems to be a known issue on OVHcloud Managed Kubernetes: https://help.ovhcloud.com/csm/en-public-cloud-kubernetes-persistentvolumes-permission-errors?id=kb_article_view&sysparm_article=KB0049758
so, to workaround problem 1, I can think of the following paths:
spec:
api:
node_selector:
<my node A label key>: <my node A label value>
content:
node_selector:
<my node A label key>: <my node A label value>
worker:
node_selector:
<my node A label key>: <my node A label value>
for problem 2, it seems like OVH team is working on a patch for this error: https://help.ovhcloud.com/csm/en-public-cloud-kubernetes-persistentvolumes-permission-errors?id=kb_article_view&sysparm_article=KB0049758#provided-solutions
Got pulp-operator working with S3.
I had to revert back to using OVH Cinder volumes for file storage. Since OVH does not support ReadWriteMany, but onlyReadWriteOnce, to use shared volume, all pulp-api, pulp-worker and pulp-content pods must be on the same node that has this volume mounted.
Suggestion to use node_selector did not suit me well because that would meant active management of node labels, making only single node usable for those pods at a time. Not very HA approach. I wanted all those pods to be on the same node as are all the other pods that need to share the volume. Not on a specific labeled node but on any node having any of these pods.
Using podAffinity works well for gathering all pods together on single node. If that node goes down, all pods move to some other node. But they always keep together.
worker:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- pulp-api
- pulp-worker
- pulp-content
topologyKey: kubernetes.io/hostname
Pods that have app.kubernetes.io value matched will be aggregated together based on topologyKey hostname.
Version Please provide the versions of the pulp-operator and pulp images in use.
Describe the bug Worker pods wont start.
To Reproduce Install latest version using Helm Using kubectl, install following Pulp CR using kubectl:
Expected behavior Workers should start
Additional context OVH Managed Kubernetes 1.30.2