Open Zalfsten opened 3 months ago
From the below log, the timeout happened during creating volume from snapshot which matches to the snapshot clone stage:
time="2024-08-13T13:10:17Z" level=info msg="Backup VSC is created from snapcontent-3082c491-ed58-4381-98fb-baeb8745d45b" logSource="pkg/exposer/csi_snapshot.go:170" owner=test5-8klq9 vsc name=test5-8klq9
time="2024-08-13T13:10:17Z" level=warning msg="The snapshot doesn't contain a valid restore size, use source volume's size {{5368709120 0} {<nil>} 5Gi BinarySI}" logSource="pkg/exposer/csi_snapshot.go:177" owner=test5-8klq9 vs name=velero-pvc-test-iscsi-fzn95
time="2024-08-13T13:10:17Z" level=info msg="Backup PVC is created" logSource="pkg/exposer/csi_snapshot.go:185" owner=test5-8klq9 pvc name=test5-8klq9
time="2024-08-13T13:10:17Z" level=info msg="Backup pod is created" logSource="pkg/exposer/csi_snapshot.go:197" owner=test5-8klq9 pod name=test5-8klq9
time="2024-08-13T13:10:17Z" level=info msg="Snapshot is exposed" controller=dataupload dataupload=velero/test5-8klq9
....
time="2024-08-13T13:40:33Z" level=info msg="Timeout happened for preparing dataupload" Dataupload=test5-8klq9 logSource="pkg/controller/data_upload_controller.go:741"
time="2024-08-13T13:40:33Z" level=info msg="Dataupload has been cleaned up" Dataupload=test5-8klq9 logSource="pkg/controller/data_upload_controller.go:770"
It is possible that the CSI driver doesn't support snapshot clone. Please check your CSI driver spec to see if it is supported
Thanks for the answer, @Lyndon-Li
It's hard to tell if my CSI driver supports snapshot cloning. On the CSI level there is the capability "CLONE_VOLUME" and according to this list my driver supports clones. But are "snapshot cloning", "CLONE_VOLUME" and "clones" the same? Also the only safe method to be sure would be to query the controller itself. I do not know how to do that.
But I will try out another CSI driver I'm using: csi-driver-nfs. Maybe this behaves differntly...
I've also had a look at the CSI driver logs. For me it looks good so far -- although the volume expansion part worries me a bit:
I0814 16:42:52.720351 1 controller.go:1366] provision "velero/test1-rb6tl" class "truenas-delete": started
I0814 16:42:52.722153 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"velero", Name:"test1-rb6tl", UID:"27729f25-3337-4b97-b7bc-c99f413b0fc6", APIVersion:"v1", ResourceVersion:"17082142", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "velero/test1-rb6tl"
I0814 16:42:52.731537 1 controller.go:1150] VolumeSnapshot &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:test1-rb6tl GenerateName: Namespace:velero SelfLink: UID:9f62e55c-a1ae-439a-93ce-cde964158383 ResourceVersion:17082143 Generation:1 CreationTimestamp:2024-08-14 16:42:52 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection] ManagedFields:[{Manager:node-agent-server Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:spec":{".":{},"f:source":{".":{},"f:volumeSnapshotContentName":{}},"f:volumeSnapshotClassName":{}}} Subresource:} {Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:finalizers":{".":{},"v:\"snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection\"":{}}}} Subresource:} {Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:status":{".":{},"f:boundVolumeSnapshotContentName":{},"f:readyToUse":{}}} Subresource:status}]} Spec:{Source:{PersistentVolumeClaimName:<nil> VolumeSnapshotContentName:0xc000638590} VolumeSnapshotClassName:0xc0006385a0} Status:0xc0007b5620}
W0814 16:42:52.734771 1 controller.go:934] Retrying syncing claim "27729f25-3337-4b97-b7bc-c99f413b0fc6", failure 0
E0814 16:42:52.734814 1 controller.go:957] error syncing claim "27729f25-3337-4b97-b7bc-c99f413b0fc6": failed to provision volume with StorageClass "truenas-delete": error getting handle for DataSource Type VolumeSnapshot by Name test1-rb6tl: snapshot test1-rb6tl is not Ready
I0814 16:42:52.734891 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"velero", Name:"test1-rb6tl", UID:"27729f25-3337-4b97-b7bc-c99f413b0fc6", APIVersion:"v1", ResourceVersion:"17082142", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "truenas-delete": error getting handle for DataSource Type VolumeSnapshot by Name test1-rb6tl: snapshot test1-rb6tl is not Ready
I0814 16:42:53.735874 1 controller.go:1366] provision "velero/test1-rb6tl" class "truenas-delete": started
I0814 16:42:53.736046 1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"velero", Name:"test1-rb6tl", UID:"27729f25-3337-4b97-b7bc-c99f413b0fc6", APIVersion:"v1", ResourceVersion:"17082142", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "velero/test1-rb6tl"
I0814 16:42:53.739340 1 controller.go:1150] VolumeSnapshot &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:test1-rb6tl GenerateName: Namespace:velero SelfLink: UID:9f62e55c-a1ae-439a-93ce-cde964158383 ResourceVersion:17082153 Generation:1 CreationTimestamp:2024-08-14 16:42:52 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection snapshot.storage.kubernetes.io/volumesnapshot-bound-protection] ManagedFields:[{Manager:node-agent-server Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:spec":{".":{},"f:source":{".":{},"f:volumeSnapshotContentName":{}},"f:volumeSnapshotClassName":{}}} Subresource:} {Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:finalizers":{".":{},"v:\"snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection\"":{},"v:\"snapshot.storage.kubernetes.io/volumesnapshot-bound-protection\"":{}}}} Subresource:} {Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:status":{".":{},"f:boundVolumeSnapshotContentName":{},"f:creationTime":{},"f:readyToUse":{},"f:restoreSize":{}}} Subresource:status}]} Spec:{Source:{PersistentVolumeClaimName:<nil> VolumeSnapshotContentName:0xc0007ae730} VolumeSnapshotClassName:0xc0007ae740} Status:0xc0000e3dd0}
I0814 16:42:53.743356 1 controller.go:1176] VolumeSnapshotContent &{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:test1-rb6tl GenerateName: Namespace: SelfLink: UID:7b49d6f9-579f-4352-83d7-c9befd117633 ResourceVersion:17082152 Generation:1 CreationTimestamp:2024-08-14 16:42:52 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[snapshot.storage.kubernetes.io/deletion-secret-name:truenas-secret snapshot.storage.kubernetes.io/deletion-secret-namespace:kube-system] OwnerReferences:[] Finalizers:[snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection] ManagedFields:[{Manager:csi-snapshotter Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:status":{".":{},"f:creationTime":{},"f:readyToUse":{},"f:restoreSize":{},"f:snapshotHandle":{}}} Subresource:status} {Manager:node-agent-server Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:annotations":{".":{},"f:snapshot.storage.kubernetes.io/deletion-secret-name":{},"f:snapshot.storage.kubernetes.io/deletion-secret-namespace":{}}},"f:spec":{".":{},"f:deletionPolicy":{},"f:driver":{},"f:source":{".":{},"f:snapshotHandle":{}},"f:volumeSnapshotClassName":{},"f:volumeSnapshotRef":{".":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}}}} Subresource:} {Manager:snapshot-controller Operation:Update APIVersion:snapshot.storage.k8s.io/v1 Time:2024-08-14 16:42:52 +0000 UTC FieldsType:FieldsV1 FieldsV1:{"f:metadata":{"f:finalizers":{".":{},"v:\"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection\"":{}}}} Subresource:}]} Spec:{VolumeSnapshotRef:{Kind: Namespace:velero Name:test1-rb6tl UID:9f62e55c-a1ae-439a-93ce-cde964158383 APIVersion: ResourceVersion:17082136 FieldPath:} DeletionPolicy:Delete Driver:csi.hpe.com VolumeSnapshotClassName:0xc0007aedc0 Source:{VolumeHandle:<nil> SnapshotHandle:0xc0007aedb0} SourceVolumeMode:<nil>} Status:0xc0005f7290}
I0814 16:42:53.743434 1 controller.go:1187] VolumeContentSource_Snapshot {Snapshot:snapshot_id:"Kubernetes_pvc-76ffe0d8-c7d0-4e6e-aea3-8adeece42916@snapshot-a81f7e24-0f58-4611-98bb-20168d53acb2" }
I0814 16:42:53.743865 1 controller.go:1195] Requested volume size is 5368709120 and snapshot size is 0 for the source snapshot test1-rb6tl
W0814 16:42:53.743899 1 controller.go:1202] requested volume size 5368709120 is greater than the size 0 for the source snapshot test1-rb6tl. Volume plugin needs to handle volume expansion.
I0814 16:42:53.747651 1 connection.go:244] GRPC call: /csi.v1.Controller/CreateVolume
I0814 16:42:53.747670 1 connection.go:245] GRPC request: {"capacity_range":{"required_bytes":5368709120},"name":"pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6","parameters":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","csi.storage.k8s.io/pv/name":"pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6","csi.storage.k8s.io/pvc/name":"test1-rb6tl","csi.storage.k8s.io/pvc/namespace":"velero","root":"Kubernetes"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"xfs"}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Snapshot":{"snapshot_id":"Kubernetes_pvc-76ffe0d8-c7d0-4e6e-aea3-8adeece42916@snapshot-a81f7e24-0f58-4611-98bb-20168d53acb2"}}}}
I0814 16:42:56.382780 1 connection.go:251] GRPC response: {"volume":{"capacity_bytes":5368709120,"content_source":{"Type":{"Snapshot":{"snapshot_id":"Kubernetes_pvc-76ffe0d8-c7d0-4e6e-aea3-8adeece42916@snapshot-a81f7e24-0f58-4611-98bb-20168d53acb2"}}},"volume_context":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","compression":"LZ4","csi.storage.k8s.io/pv/name":"pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6","csi.storage.k8s.io/pvc/name":"test1-rb6tl","csi.storage.k8s.io/pvc/namespace":"velero","deduplication":"OFF","fsType":"xfs","hostEncryption":"false","root":"Kubernetes","sync":"STANDARD","targetScope":"volume","volblocksize":"8K","volumeAccessMode":"mount"},"volume_id":"Kubernetes_pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6"}}
I0814 16:42:56.382800 1 connection.go:252] GRPC error: <nil>
I0814 16:42:56.382813 1 controller.go:858] create volume rep: {CapacityBytes:5368709120 VolumeId:Kubernetes_pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6 VolumeContext:map[allowOverrides:sparse,compression,deduplication,volblocksize,sync,description compression:LZ4 csi.storage.k8s.io/pv/name:pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6 csi.storage.k8s.io/pvc/name:test1-rb6tl csi.storage.k8s.io/pvc/namespace:velero deduplication:OFF fsType:xfs hostEncryption:false root:Kubernetes sync:STANDARD targetScope:volume volblocksize:8K volumeAccessMode:mount] ContentSource:snapshot:<snapshot_id:"Kubernetes_pvc-76ffe0d8-c7d0-4e6e-aea3-8adeece42916@snapshot-a81f7e24-0f58-4611-98bb-20168d53acb2" > AccessibleTopology:[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0814 16:42:56.382891 1 controller.go:930] createVolumeOperation: set annotation [volume.kubernetes.io/provisioner-deletion-secret-namespace/volume.kubernetes.io/provisioner-deletion-secret-name] on pv [pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6].
I0814 16:42:56.382921 1 controller.go:960] successfully created PV pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6 for PVC test1-rb6tl and csi volume name Kubernetes_pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6
I0814 16:42:56.382939 1 controller.go:976] successfully created PV {GCEPersistentDisk:nil AWSElasticBlockStore:nil HostPath:nil Glusterfs:nil NFS:nil RBD:nil ISCSI:nil Cinder:nil CephFS:nil FC:nil Flocker:nil FlexVolume:nil AzureFile:nil VsphereVolume:nil Quobyte:nil AzureDisk:nil PhotonPersistentDisk:nil PortworxVolume:nil ScaleIO:nil Local:nil StorageOS:nil CSI:&CSIPersistentVolumeSource{Driver:csi.hpe.com,VolumeHandle:Kubernetes_pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6,ReadOnly:false,FSType:xfs,VolumeAttributes:map[string]string{allowOverrides: sparse,compression,deduplication,volblocksize,sync,description,compression: LZ4,csi.storage.k8s.io/pv/name: pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6,csi.storage.k8s.io/pvc/name: test1-rb6tl,csi.storage.k8s.io/pvc/namespace: velero,deduplication: OFF,fsType: xfs,hostEncryption: false,root: Kubernetes,storage.kubernetes.io/csiProvisionerIdentity: 1723652614992-6440-csi.hpe.com,sync: STANDARD,targetScope: volume,volblocksize: 8K,volumeAccessMode: mount,},ControllerPublishSecretRef:&SecretReference{Name:truenas-secret,Namespace:kube-system,},NodeStageSecretRef:&SecretReference{Name:truenas-secret,Namespace:kube-system,},NodePublishSecretRef:&SecretReference{Name:truenas-secret,Namespace:kube-system,},ControllerExpandSecretRef:&SecretReference{Name:truenas-secret,Namespace:kube-system,},NodeExpandSecretRef:nil,}}
I0814 16:42:56.383025 1 controller.go:1449] provision "velero/test1-rb6tl" class "truenas-delete": volume "pvc-27729f25-3337-4b97-b7bc-c99f413b0fc6" provisioned
I0814 16:42:56.383038 1 controller.go:1462] provision "velero/test1-rb6tl" class "truenas-delete": succeeded
@Zalfsten
During data movement velero will create a temporary pod (Backup pod
in the log) and mount a PVC created via the snapshot.
Based on the log messages
time="2024-08-13T13:10:17Z" level=info msg="Backup PVC is created" logSource="pkg/exposer/csi_snapshot.go:185" owner=test5-8klq9 pvc name=test5-8klq9
time="2024-08-13T13:10:17Z" level=info msg="Backup pod is created" logSource="pkg/exposer/csi_snapshot.go:197" owner=test5-8klq9 pod name=test5-8klq9
time="2024-08-13T13:10:17Z" level=info msg="Snapshot is exposed" controller=dataupload dataupload=velero/test5-8klq9
time="2024-08-13T13:10:17Z" level=info msg="Reconcile test5-8klq9" controller=dataupload dataupload=velero/test5-8klq9 logSource="pkg/controller/data_upload_controller.go:112"
time="2024-08-13T13:10:33Z" level=info msg="Reconcile test5-8klq9" controller=dataupload dataupload=velero/test5-8klq9 logSource="pkg/controller/data_upload_controller.go:112"
time="2024-08-13T13:11:33Z" level=info msg="Reconcile test5-8klq9" controller=dataupload dataupload=velero/test5-8klq9 logSource="pkg/controller/data_upload_controller.go:112"
....
time="2024-08-13T13:40:33Z" level=info msg="Timeout happened for preparing dataupload" Dataupload=test5-8klq9 logSource="pkg/controller/data_upload_controller.go:741"
The pod did not transfer to Running
state in 30 mins, and eventually was deleted. So in the log-bundle there's no clear message to help us understanding why the state of the pod was stuck in Pending
.
Have you tested that you can manually take a snapshot and restore from it using the csi driver and snapshotter in your env? If this works, you may need to reproduce the issue, and grab the temporary pod and pvc, and check their status to see why they were stuck.
cc @Lyndon-Li I think we should dump more message in node-agent before the pod and pvc are cleaned up in this timeout situation to help us debug.
@reasonerjt
I think we should dump more message in node-agent before the pod and pvc are cleaned up in this timeout situation to help us debug
Yes, I opened an issue for this, see #8125
@reasonerjt
Sorry, I've had little time to work on this issue. I can manually restore from a snapshot with my CSI driver (i.e. I create a pvc, take a snapshot from the pvc and then create a new pvc with the snapshot as source -- please give me a hint if this is not the correct test). I will check the backup pod and pvc tomorrow. Any new findings I will post in the issue #8125.
Oops... I really do not know what happened, but the backups now shows "completed".
The only thing I can tell is that I've recreated the StorageClasses -- without changing anything, as far as I can tell.
Now I have the same issue again, this time using a different CSI driver.
The temporary pod shows this error message:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 10m (x2 over 10m) default-scheduler 0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
Normal Scheduled 10m default-scheduler Successfully assigned velero/test3-j8bf6 to elise-01
Warning FailedMount 24s (x13 over 10m) kubelet MountVolume.MountDevice failed for volume "pvc-52178ff2-b3be-495a-a28e-a98016f6a337" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name nfs.csi.k8s.io not found in the list of registered CSI drivers
Contrast to the error message, I can find the CSI driver:
$ kubectl describe csidrivers.storage.k8s.io nfs.csi.k8s.io
Name: nfs.csi.k8s.io
Namespace:
Labels: argocd.argoproj.io/instance=storage
Annotations: argocd.argoproj.io/tracking-id: storage:storage.k8s.io/CSIDriver:storage/nfs.csi.k8s.io
API Version: storage.k8s.io/v1
Kind: CSIDriver
Metadata:
Creation Timestamp: 2024-08-20T17:20:14Z
Resource Version: 20159506
UID: 1a8ffe86-c836-4084-8825-4cc0cf3550df
Spec:
Attach Required: false
Fs Group Policy: File
Pod Info On Mount: false
Requires Republish: false
Se Linux Mount: false
Storage Capacity: false
Volume Lifecycle Modes:
Persistent
Events: <none>
Now it works again. Seems to be non-deterministic...
What steps did you take and what happened:
I'm trying to create a backup from a sample nginx deployment (including a PVC) in the namespace
default
. These are my steps:then, after half an hour:
In the node agent log, I observe the following error message:
What did you expect to happen:
I expected to see a big file in my S3 storage, containing the backup of the persistent volume (though I do not know what exactly to look out for).
The following information will help us better understand what's going on:
bundle-2024-08-13-15-53-39.tar.gz
Anything else you would like to add:
I'm unsure if I need to add a VolumeSnapshotLocation (using the provider
csi
). But if I do, I run into another error about velero not finding a matching VolumeSnapshotter.I've installed the velero server using the helm chart version 7.1.2. This is my value file:
I suppose the VolumeSnapshotClass is also of interest:
Environment:
Velero version (use
velero version
):(I've tried v1.14.0 before, but also without success)
Velero features (use
velero client config get features
):Kubernetes version (use
kubectl version
):on-premise
Kubernetes installer & version: kubespray 2.24.0
Cloud provider or hardware configuration: 2-node cluster on VMware, 4 CPUs and 8GB RAM per node.
OS (e.g. from
/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.