Open ambiknai opened 2 years ago
@ggriffiths Can you take a look? These are from the "update" that are not replaced with "patch" in snapshot-controller.
Yes, there are many spots where we still use "update" instead of "patch":
pkg/common-controller/snapshot_controller.go:1383: newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1454: updatedSnapshot, err = ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1523: newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1618: updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1660: updatedSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshot.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:547: updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:642: updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:813: newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1218: newSnapshotObj, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:403: newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), content, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:459: newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), contentClone, metav1.UpdateOptions{})
This error will still be hit in these scenarios. We reduced the major scenarios in https://github.com/kubernetes-csi/external-snapshotter/pull/526, but there is more work to be done. I'm happy to review a PR for this work.
/help
@ggriffiths: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
I've sent a PR-757
/assign @camartinez04
Also hit this issue.
E1011 03:12:02.763356 1 snapshot_controller.go:993] checkandRemovePVCFinalizer [velero-kibishii-data-kibishii-deployment-0-tsdgw]: removePVCFinalizer failed to remove finalizer snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again
E1011 03:12:02.763396 1 snapshot_controller.go:189] error check and remove PVC finalizer for snapshot [velero-kibishii-data-kibishii-deployment-0-tsdgw]: snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again
/unassign @camartinez04
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the PR is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@ggriffiths @xing-yang If anyone is not actively working on this issue, I would like to work on it.
Hi @shubham-pampattiwar, that's great. Can you please assign this issue to yourself? Thanks.
/remove-lifecycle stale
/assign
/unassign @shubham-pampattiwar
Spottet this issue when using Velero 1.12.1 and CSI plugin 0.6.1 together with SC 6.3.1:
E1106 14:02:37.353831 1 snapshot_controller.go:107] createSnapshot for content [snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd]: error occurred in createSnapshotWrapper: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
E1106 14:02:37.353865 1 snapshot_controller_base.go:351] could not sync content "snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd": failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
I1106 14:02:37.353904 1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd through the plugin ...
I1106 14:02:37.353931 1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd", UID:"194a4dc1-c11c-44d4-acb4-504a0fed9314", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"934203821", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
As the logs indicate the patch rewrites are not in the 6.3.1 release yet. So please release them ASAP to close this issue.
I've cherry-picked commit https://github.com/kubernetes-csi/external-snapshotter/commit/ff71329d8cc08dca53194f8d4568ed53e69ecb82 from main to the release-6.3 branch and created new images:
I can confirm the patch rewrites solve this issue. IMHO the patch could be safely merged to the release-6.3 branch.
We have also seen similar errors about snapshot status update, most likely coming from UpdateStatus
mentioned above. I don't see them addressed in the PR, but I believe those are also important.
Any update on this? This happens pretty consistently for me. Running v6.3.3
.
@julienvincent can you provide logs to understand the context? How do you create snapshots? Often two or more processes interact with the VolumeSnapshot objects.
@phoenix-bjoern Sure! What kind of logs would you like? Happy to provide.
Similar to other users in this thread, I am using Velero@1.12.2
with the csi plugin at 0.6.2
and Longhorn@1.5.3
. Velero is creating the VolumeSnapshots and I am getting the same error as reported in this issue.
For example:
Error:
Message: Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad: "snapshot controller failed to update snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad\": the object has been modified; please apply your changes to the latest version and try again"
Time: 2023-12-28T00:01:34Z
Ready To Use: false
Happy to provide any other information you need.
@julienvincent the snapshot controller log should have a trace for the error. Can you share it to identify the exact code lines?
@phoenix-bjoern sure, here is an example:
E1230 12:33:25.731663 1 snapshot_controller_base.go:470] could not sync snapshot "default/velero-registry-data-7737fc94-tn4m2": snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:
I1230 12:33:25.731729 1 snapshot_controller_base.go:288] Failed to sync snapshot "default/velero-registry-data-7737fc94-tn4m2", will retry again: snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:
Thanks @julienvincent for the additional context. The trace shows a storage error: StorageError: invalid object, Code: 4
. It seems the snapshot controller doesn't have an issue. It only catches the exception thrown by Velero and/or the CSI driver. Maybe check https://github.com/vmware-tanzu/velero/issues/6179 for more context.
@phoenix-bjoern I might be misunderstanding something but the linked issue doesn't seem directly related to this issue.
That issue is more about longhorn behaviour of not deleting it's internal snapshots when CSI volume snapshots are deleted. AFAIU Velero is not actually involved in the CSI snapshot process other than initially creating the VolumeSnapshot
resource. I can reproduce this issue independently of Velero by manually creating a VolumeSnapshot
which I think excludes Velero's involvement in this.
In this case the snapshots themselves are not being reported as successful (but the underlying driver is successfully performing a snapshot).
But if I understand what you are saying this error message is set on the VolumeSnapshot resource by the driver (longhorn) and external-snapshotter is just reporting/relaying it?
Would you recommend opening an issue with longhorn?
@julienvincent The snapshot controller is only triggering a process which the storage driver then executes. Since the error seems to occur in the storage driver there is nothing you can in the snapshot controller. So yes, it makes sense to continue investigating the issue in the Longhorn components.
Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.
Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.
Having refactored last month. However, some of tests failed since then. So I took several days to fix them until now.
@hoyho did you have time to finish ? do you know any ETA? thanks
@hoyho did you have time to finish ? do you know any ETA? thanks
No problems. Probably will do it next week
What happened:
In our csi-driver snapshot controller logs shows below error. There is no impact on functionality. But there are too many of these errors.
What you expected to happen:
How to reproduce it:
Anything else we need to know?:
Environment: I tested in IKs 1.22,1.23,1.24
kubectl version
):uname -a
):