Closed pradumnapandit closed 1 year ago
@pradumnapandit what is your cephlet
's command line configuration for --populator-image
?
We tried setting image to ghcr.io/onmetal/populator:sha-1c2bf11 but after that populator pod fails and volume gets stuck in pending state
@pradumnapandit does it fail with the error reported above or another error?
@adracus getting error as populator pod failed:
2023-02-21T09:30:30Z ERROR Reconciler error {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "cfc3c9c6-ac6d-44e9-b0a3-cc506cbfcf63", "error": "populator pod populator-system/populate-86e960aa-a97d-4ce3-ab54-750a2bc1d33e failed"}
@aditya-dixit99 what is the output of the populator pod? What image are you trying to populate? Is that image well-formed (i.e. has a rootfs
layer)?
@adracus Image we tried to populate: ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
Populator pod came for few seconds, and we were able to capture the logs for the same as:
2023-02-21T09:30:29Z INFO Starting image population
2023-02-21T09:30:36Z INFO Successfully populated device {“Device”: “/dev/block”}```
Seems it has invalid rootfs in image configuration,
@adracus , can we get sample image? We got the above image from Hardik.
I'm mostly using ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test7
right now. I'm also inspecting the image you referenced. However, looking at the logs @aditya-dixit99 provided it seems the population is successful - is the state of the pod succeeded / do you have any any other watchdogs in your cluster that might clean up the pod prematurely?
Just checked the image ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
and at least the metadata / layers look right.
@adracus docker pull ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
is showing below error that's why we are not able to inspect the same.
root@master:~/cephlet# docker pull ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
ign-test8: Pulling from hardikdr/onmetal-image/gardenlinux
1ae2163ce3ce: Pulling fs layer
3fc8162696d1: Pulling fs layer
629ef186a225: Pulling fs layer
invalid rootfs in image configuration
The populator pod vanishes before we could see the state but populator pod failed
we got in cephlet logs.
Additionally, we don't have watchdog in cluster that can clean up the pod prematurely.
You can't docker pull
an onmetal-image
- those are two different formats. You need to use http://github.com/onmetal/onmetal-image for pulling & inspecting these images.
Regarding populator pod failed
: This 'error' is produced if the population is not yet succeeded but still ongoing - is the volume in correct state after a while?
The volume state is in still pending
state.
root@master:/home/tux# kubectl get volume -A
NAMESPACE NAME VOLUMEPOOLREF IMAGE VOLUMECLASS STATE PHASE AGE
rook-ceph sample-volume-2 ceph ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8 fast Pending Unbound 91m
Can you please kubectl describe
the Volume
which is in the pending state?
Output of kubectl describe volume
root@master:/home/tux# kubectl describe volume sample-volume-2 -n rook-ceph
Name: sample-volume-2
Namespace: rook-ceph
Labels: <none>
Annotations: <none>
API Version: storage.api.onmetal.de/v1alpha1
Kind: Volume
Metadata:
Creation Timestamp: 2023-02-21T10:21:37Z
Managed Fields:
API Version: storage.api.onmetal.de/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
f:image:
f:resources:
.:
f:storage:
f:volumeClassRef:
f:volumePoolRef:
f:status:
f:phase:
f:state:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2023-02-21T10:21:37Z
Resource Version: 93
UID: cffca656-932e-4318-bc5b-084349c6bd55
Spec:
Image: ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
Resources:
Storage: 15Gi
Volume Class Ref:
Name: fast
Volume Pool Ref:
Name: ceph
Status:
Phase: Unbound
State: Pending
Events: <none>
Okay. That looks fine so far.
Can you also check the corresponding PVCs & PVs from the Volume
itself and from the Snapshot
?
You might also find some insights by checking the csi-rbdplugin
& csi-rbdplugin-provisioner
logs.
PVC is not getting created.
root@master:/home/tux# kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
onmetal-system etcd-data-dir-onmetal-etcd-0 Bound pvc-cf59c4fe-82c7-4ccd-ab31-3a2dad8ea13d 1Gi RWO local-path 5h26m
rook-ceph ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8 Pending volume-rook-ceph--ceph 108m
Additionally we checked logs for snapshot-controller
pod
as
I0221 12:12:42.078454 1 reflector.go:243] Listing and watching *v1beta1.VolumeSnapshot from github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117
E0221 12:12:42.079788 1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshot: failed to list *v1beta1.VolumeSnapshot: the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
I0221 12:12:42.903894 1 reflector.go:369] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: forcing resync
I0221 12:13:06.956191 1 reflector.go:243] Listing and watching *v1beta1.VolumeSnapshotClass from github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117
E0221 12:13:06.957974 1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)
Okay, we checked in csi-rbdplugin
and csi-rbdplugin-provisioner
but didn't found anything specific.
At the moment the cephlet
only creates a PVC
and the actual storage provisioning is done by the ceph-csi
. So if the PVC
is pending so the Volume
is in pending.
Ah, just spotted this in your attached logs:
Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)
The reason for the problem is that volumesnapshotclasses.snapshot.storage.k8s.io
is not present.
volumesnapshotclasses.snapshot.storage.k8s.io
is present.
Following is way to see it right?
root@master:/home/tux# kubectl get volumesnapshotclasses -A
NAME DRIVER DELETIONPOLICY AGE
volume-rook-ceph--ceph rook-ceph.rbd.csi.ceph.com Delete 5h9m
Can you please list the known api-resources
? You need to have volumesnapshotclasses.snapshot.storage.k8s.io/v1
@lukasfrank here is the output:
root@master:/home/tux# kubectl api-resources | grep "snapshot.storage.k8s.io"
volumesnapshotclasses vsclass,vsclasses snapshot.storage.k8s.io/v1 false VolumeSnapshotClass
volumesnapshotcontents snapshot.storage.k8s.io/v1beta1 false VolumeSnapshotContent
volumesnapshots vs snapshot.storage.k8s.io/v1 true VolumeSnapshot
Okay, as you can see there is a version
mismatch:
CSI: VolumeSnapshotClass v1beta1
Installed: VolumeSnapshotClass v1
Make sure volumesnapshotclasses
, volumesnapshotcontents
and volumesnapshots
are in the cluster present and have api version v1
. Furthermore the CSI
and the snapshotter
must match/reference this version.
I will close the issue for now, since it's not cephlet
related. Feel free to reopen if you can verify that we have an issue in the cephlet
itself.
Describe the bug Deployment of Cephlet has issue to create
populator Pod
To Reproduce
Deploy cephlet
Create volume with following manifest.
Error Observer
Additional context
pending state