ironcore-dev / ceph-provider

Ceph provider implementation of the IronCore storage interface
https://ironcore-dev.github.io/ceph-provider/
Apache License 2.0
2 stars 4 forks source link

Error creating Populator Pod #190

Closed pradumnapandit closed 1 year ago

pradumnapandit commented 1 year ago

Describe the bug Deployment of Cephlet has issue to create populator Pod

To Reproduce

  1. Deploy cephlet

  2. Create volume with following manifest.

apiVersion: storage.api.onmetal.de/v1alpha1
kind: Volume
metadata:
  name: test-volume-2
  namespace: tsi
spec:
  volumePoolRef:
    name: ceph
  volumeClassRef:
    name: fast
  image: ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
  resources:
    storage: 15Gi

Error Observer

2023-02-21T10:22:13Z    ERROR   Reconciler error    {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "122693de-b181-40f5-867d-8a141913ea33", "error": "could not create populator pod: Pod \"populate-17961ef1-9f2a-45b8-b49b-5e5791bc541d\" is invalid: spec.containers[0].image: Required value"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235
2023-02-21T10:22:54Z    INFO    Reconciling PVC {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "ab320c25-cac6-4e12-a9b8-309de2b2f9f3"}
2023-02-21T10:22:54Z    INFO    Found datasource ref for PVC    {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "ab320c25-cac6-4e12-a9b8-309de2b2f9f3", "DataSourceRef": "&TypedObjectReference{APIGroup:*storage.api.onmetal.de/v1alpha1,Kind:Volume,Name:sample-volume-2,Namespace:nil,}"}
2023-02-21T10:22:54Z    INFO    Found volume as datasource ref for PVC  {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "ab320c25-cac6-4e12-a9b8-309de2b2f9f3", "Volume": {"namespace": "rook-ceph", "name": "sample-volume-2"}}
2023-02-21T10:22:54Z    INFO    Found StorageClass for PVC  {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "ab320c25-cac6-4e12-a9b8-309de2b2f9f3", "StorageClass": {"name": "volume-rook-ceph--ceph"}}
2023-02-21T10:22:54Z    ERROR   Reconciler error    {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "ab320c25-cac6-4e12-a9b8-309de2b2f9f3", "error": "could not create populator pod: Pod \"populate-17961ef1-9f2a-45b8-b49b-5e5791bc541d\" is invalid: spec.containers[0].image: Required value"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.4/pkg/internal/controller/controller.go:235
2023-02-21T10:24:16Z    INFO    Reconciling PVC {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "943e4fb3-c071-49b0-9b15-c633a05cd5f9"}

Additional context

adracus commented 1 year ago

@pradumnapandit what is your cephlet's command line configuration for --populator-image?

pradumnapandit commented 1 year ago

We tried setting image to ghcr.io/onmetal/populator:sha-1c2bf11 but after that populator pod fails and volume gets stuck in pending state

adracus commented 1 year ago

@pradumnapandit does it fail with the error reported above or another error?

aditya-dixit99 commented 1 year ago

@adracus getting error as populator pod failed: 2023-02-21T09:30:30Z ERROR Reconciler error {"controller": "persistentvolumeclaim", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8","namespace":"rook-ceph"}, "namespace": "rook-ceph", "name": "ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8", "reconcileID": "cfc3c9c6-ac6d-44e9-b0a3-cc506cbfcf63", "error": "populator pod populator-system/populate-86e960aa-a97d-4ce3-ab54-750a2bc1d33e failed"}

adracus commented 1 year ago

@aditya-dixit99 what is the output of the populator pod? What image are you trying to populate? Is that image well-formed (i.e. has a rootfs layer)?

aditya-dixit99 commented 1 year ago

@adracus Image we tried to populate: ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8 Populator pod came for few seconds, and we were able to capture the logs for the same as:


2023-02-21T09:30:29Z    INFO    Starting image population
2023-02-21T09:30:36Z    INFO    Successfully populated device   {“Device”: “/dev/block”}```

Seems it has invalid rootfs in image configuration,
pradumnapandit commented 1 year ago

@adracus , can we get sample image? We got the above image from Hardik.

adracus commented 1 year ago

I'm mostly using ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test7 right now. I'm also inspecting the image you referenced. However, looking at the logs @aditya-dixit99 provided it seems the population is successful - is the state of the pod succeeded / do you have any any other watchdogs in your cluster that might clean up the pod prematurely?

adracus commented 1 year ago

Just checked the image ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8 and at least the metadata / layers look right.

aditya-dixit99 commented 1 year ago

@adracus docker pull ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8 is showing below error that's why we are not able to inspect the same.

root@master:~/cephlet# docker pull ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
ign-test8: Pulling from hardikdr/onmetal-image/gardenlinux
1ae2163ce3ce: Pulling fs layer 
3fc8162696d1: Pulling fs layer 
629ef186a225: Pulling fs layer 
invalid rootfs in image configuration

The populator pod vanishes before we could see the state but populator pod failed we got in cephlet logs. Additionally, we don't have watchdog in cluster that can clean up the pod prematurely.

adracus commented 1 year ago

You can't docker pull an onmetal-image - those are two different formats. You need to use http://github.com/onmetal/onmetal-image for pulling & inspecting these images. Regarding populator pod failed: This 'error' is produced if the population is not yet succeeded but still ongoing - is the volume in correct state after a while?

pradumnapandit commented 1 year ago

The volume state is in still pending state.

root@master:/home/tux# kubectl get volume -A
NAMESPACE   NAME              VOLUMEPOOLREF   IMAGE                                                  VOLUMECLASS   STATE     PHASE     AGE
rook-ceph   sample-volume-2   ceph            ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8   fast          Pending   Unbound   91m
lukasfrank commented 1 year ago

Can you please kubectl describe the Volume which is in the pending state?

pradumnapandit commented 1 year ago

Output of kubectl describe volume

root@master:/home/tux# kubectl describe volume sample-volume-2 -n rook-ceph 
Name:         sample-volume-2
Namespace:    rook-ceph
Labels:       <none>
Annotations:  <none>
API Version:  storage.api.onmetal.de/v1alpha1
Kind:         Volume
Metadata:
  Creation Timestamp:  2023-02-21T10:21:37Z
  Managed Fields:
    API Version:  storage.api.onmetal.de/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        f:image:
        f:resources:
          .:
          f:storage:
        f:volumeClassRef:
        f:volumePoolRef:
      f:status:
        f:phase:
        f:state:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2023-02-21T10:21:37Z
  Resource Version:  93
  UID:               cffca656-932e-4318-bc5b-084349c6bd55
Spec:
  Image:  ghcr.io/hardikdr/onmetal-image/gardenlinux:ign-test8
  Resources:
    Storage:  15Gi
  Volume Class Ref:
    Name:  fast
  Volume Pool Ref:
    Name:  ceph
Status:
  Phase:  Unbound
  State:  Pending
Events:   <none>
lukasfrank commented 1 year ago

Okay. That looks fine so far. Can you also check the corresponding PVCs & PVs from the Volume itself and from the Snapshot?

lukasfrank commented 1 year ago

You might also find some insights by checking the csi-rbdplugin & csi-rbdplugin-provisioner logs.

pradumnapandit commented 1 year ago

PVC is not getting created.

root@master:/home/tux# kubectl get pvc -A
NAMESPACE        NAME                                                   STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS             AGE
onmetal-system   etcd-data-dir-onmetal-etcd-0                           Bound     pvc-cf59c4fe-82c7-4ccd-ab31-3a2dad8ea13d   1Gi        RWO            local-path               5h26m
rook-ceph        ghcr.io-hardikdr-onmetal-image-gardenlinux-ign-test8   Pending                                                                        volume-rook-ceph--ceph   108m

Additionally we checked logs for snapshot-controller pod as

I0221 12:12:42.078454       1 reflector.go:243] Listing and watching *v1beta1.VolumeSnapshot from github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117
E0221 12:12:42.079788       1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshot: failed to list *v1beta1.VolumeSnapshot: the server could not find the requested resource (get volumesnapshots.snapshot.storage.k8s.io)
I0221 12:12:42.903894       1 reflector.go:369] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: forcing resync
I0221 12:13:06.956191       1 reflector.go:243] Listing and watching *v1beta1.VolumeSnapshotClass from github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117
E0221 12:13:06.957974       1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)

Okay, we checked in csi-rbdplugin and csi-rbdplugin-provisioner but didn't found anything specific.

lukasfrank commented 1 year ago

At the moment the cephlet only creates a PVC and the actual storage provisioning is done by the ceph-csi. So if the PVC is pending so the Volume is in pending.

lukasfrank commented 1 year ago

Ah, just spotted this in your attached logs:

Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)

The reason for the problem is that volumesnapshotclasses.snapshot.storage.k8s.io is not present.

pradumnapandit commented 1 year ago

volumesnapshotclasses.snapshot.storage.k8s.io is present. Following is way to see it right?

root@master:/home/tux# kubectl get volumesnapshotclasses -A
NAME                     DRIVER                       DELETIONPOLICY   AGE
volume-rook-ceph--ceph   rook-ceph.rbd.csi.ceph.com   Delete           5h9m
lukasfrank commented 1 year ago

Can you please list the known api-resources? You need to have volumesnapshotclasses.snapshot.storage.k8s.io/v1

aditya-dixit99 commented 1 year ago

@lukasfrank here is the output:

root@master:/home/tux# kubectl api-resources  | grep "snapshot.storage.k8s.io"
volumesnapshotclasses             vsclass,vsclasses   snapshot.storage.k8s.io/v1             false        VolumeSnapshotClass
volumesnapshotcontents                                snapshot.storage.k8s.io/v1beta1        false        VolumeSnapshotContent
volumesnapshots                   vs                  snapshot.storage.k8s.io/v1             true         VolumeSnapshot
lukasfrank commented 1 year ago

Okay, as you can see there is a version mismatch: CSI: VolumeSnapshotClass v1beta1 Installed: VolumeSnapshotClass v1

Make sure volumesnapshotclasses, volumesnapshotcontents and volumesnapshots are in the cluster present and have api version v1. Furthermore the CSI and the snapshotter must match/reference this version.

I will close the issue for now, since it's not cephlet related. Feel free to reopen if you can verify that we have an issue in the cephlet itself.