snapshot-controller logs report failure frequently

ambiknai commented 2 years ago

What happened:

In our csi-driver snapshot controller logs shows below error. There is no impact on functionality. But there are too many of these errors.

snapshot controller failed to update ...xxxxx
the object has been modified; please apply your changes to the latest version and try again

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment: I tested in IKs 1.22,1.23,1.24

Driver version: Our csi driver is not GA
Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others: gcr.io/k8s-staging-sig-storage/csi-snapshotter:v6.0.1 gcr.io/k8s-staging-sig-storage/snapshot-controller:v6.0.1

xing-yang commented 2 years ago

@ggriffiths Can you take a look? These are from the "update" that are not replaced with "patch" in snapshot-controller.

ggriffiths commented 2 years ago

Yes, there are many spots where we still use "update" instead of "patch":

[ ] Update:

pkg/common-controller/snapshot_controller.go:1383:  newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1454:      updatedSnapshot, err = ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1523:  newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1618:  updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1660:  updatedSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshot.Namespace).Update(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:547:  updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:642:  updatedContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().Update(context.TODO(), contentClone, metav1.UpdateOptions{})

[ ] UpdateStatus:

pkg/common-controller/snapshot_controller.go:813:   newSnapshot, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/common-controller/snapshot_controller.go:1218:      newSnapshotObj, err := ctrl.clientset.SnapshotV1().VolumeSnapshots(snapshotClone.Namespace).UpdateStatus(context.TODO(), snapshotClone, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:403:  newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), content, metav1.UpdateOptions{})
pkg/sidecar-controller/snapshot_controller.go:459:      newContent, err := ctrl.clientset.SnapshotV1().VolumeSnapshotContents().UpdateStatus(context.TODO(), contentClone, metav1.UpdateOptions{})

This error will still be hit in these scenarios. We reduced the major scenarios in https://github.com/kubernetes-csi/external-snapshotter/pull/526, but there is more work to be done. I'm happy to review a PR for this work.

ggriffiths commented 2 years ago

/help

k8s-ci-robot commented 2 years ago

@ggriffiths: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-csi/external-snapshotter/issues/748): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

camartinez04 commented 2 years ago

I've sent a PR-757

ggriffiths commented 2 years ago

/assign @camartinez04

blackpiglet commented 2 years ago

Also hit this issue.

E1011 03:12:02.763356       1 snapshot_controller.go:993] checkandRemovePVCFinalizer [velero-kibishii-data-kibishii-deployment-0-tsdgw]: removePVCFinalizer failed to remove finalizer snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again
E1011 03:12:02.763396       1 snapshot_controller.go:189] error check and remove PVC finalizer for snapshot [velero-kibishii-data-kibishii-deployment-0-tsdgw]: snapshot controller failed to update kibishii-data-kibishii-deployment-0 on API server: Operation cannot be fulfilled on persistentvolumeclaims "kibishii-data-kibishii-deployment-0": the object has been modified; please apply your changes to the latest version and try again

ggriffiths commented 2 years ago

/unassign @camartinez04

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

shubham-pampattiwar commented 1 year ago

@ggriffiths @xing-yang If anyone is not actively working on this issue, I would like to work on it.

xing-yang commented 1 year ago

Hi @shubham-pampattiwar, that's great. Can you please assign this issue to yourself? Thanks.

xing-yang commented 1 year ago

/remove-lifecycle stale

shubham-pampattiwar commented 1 year ago

/assign

ggriffiths commented 1 year ago

/unassign @shubham-pampattiwar

phoenix-bjoern commented 1 year ago

Spottet this issue when using Velero 1.12.1 and CSI plugin 0.6.1 together with SC 6.3.1:

E1106 14:02:37.353831       1 snapshot_controller.go:107] createSnapshot for content [snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd]: error occurred in createSnapshotWrapper: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
E1106 14:02:37.353865       1 snapshot_controller_base.go:351] could not sync content "snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd": failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"
I1106 14:02:37.353904       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd through the plugin ...
I1106 14:02:37.353931       1 event.go:298] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd", UID:"194a4dc1-c11c-44d4-acb4-504a0fed9314", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"934203821", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd: "snapshot controller failed to update snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-3cce7669-3854-4ee3-a157-d3abb5ff2dcd\": the object has been modified; please apply your changes to the latest version and try again"

As the logs indicate the patch rewrites are not in the 6.3.1 release yet. So please release them ASAP to close this issue.

phoenix-bjoern commented 1 year ago

I've cherry-picked commit https://github.com/kubernetes-csi/external-snapshotter/commit/ff71329d8cc08dca53194f8d4568ed53e69ecb82 from main to the release-6.3 branch and created new images:

phoenixmedia/k8s-snapshot-controller:v6.3.1-patch1
phoenixmedia/k8s-csi-snapshotter:v6.3.1-patch1

I can confirm the patch rewrites solve this issue. IMHO the patch could be safely merged to the release-6.3 branch.

phoenix-bjoern commented 1 year ago

See https://github.com/kubernetes-csi/external-snapshotter/pull/876#issuecomment-1797945863

hairyhum commented 1 year ago

We have also seen similar errors about snapshot status update, most likely coming from UpdateStatus mentioned above. I don't see them addressed in the PR, but I believe those are also important.

julienvincent commented 11 months ago

Any update on this? This happens pretty consistently for me. Running v6.3.3.

phoenix-bjoern commented 11 months ago

@julienvincent can you provide logs to understand the context? How do you create snapshots? Often two or more processes interact with the VolumeSnapshot objects.

julienvincent commented 11 months ago

@phoenix-bjoern Sure! What kind of logs would you like? Happy to provide.

Similar to other users in this thread, I am using Velero@1.12.2 with the csi plugin at 0.6.2 and Longhorn@1.5.3. Velero is creating the VolumeSnapshots and I am getting the same error as reported in this issue.

For example:

  Error:
    Message:     Failed to check and update snapshot content: failed to remove VolumeSnapshotBeingCreated annotation on the content snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad: "snapshot controller failed to update snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-48e5a79d-6d41-4b28-9d17-24cfaa920cad\": the object has been modified; please apply your changes to the latest version and try again"
    Time:        2023-12-28T00:01:34Z
  Ready To Use:  false

Happy to provide any other information you need.

phoenix-bjoern commented 11 months ago

@julienvincent the snapshot controller log should have a trace for the error. Can you share it to identify the exact code lines?

julienvincent commented 11 months ago

@phoenix-bjoern sure, here is an example:

E1230 12:33:25.731663       1 snapshot_controller_base.go:470] could not sync snapshot "default/velero-registry-data-7737fc94-tn4m2": snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:
I1230 12:33:25.731729       1 snapshot_controller_base.go:288] Failed to sync snapshot "default/velero-registry-data-7737fc94-tn4m2", will retry again: snapshot controller failed to update velero-registry-data-7737fc94-tn4m2 on API server: Operation cannot be fulfilled on volumesnapshots.snapshot.storage.k8s.io "velero-registry-data-7737fc94-tn4m2": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshots/default/velero-registry-data-7737fc94-tn4m2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 84d606d2-68a0-4ea7-b670-ea3068ce4975, UID in object meta:

phoenix-bjoern commented 11 months ago

Thanks @julienvincent for the additional context. The trace shows a storage error: StorageError: invalid object, Code: 4. It seems the snapshot controller doesn't have an issue. It only catches the exception thrown by Velero and/or the CSI driver. Maybe check https://github.com/vmware-tanzu/velero/issues/6179 for more context.

julienvincent commented 11 months ago

@phoenix-bjoern I might be misunderstanding something but the linked issue doesn't seem directly related to this issue.

That issue is more about longhorn behaviour of not deleting it's internal snapshots when CSI volume snapshots are deleted. AFAIU Velero is not actually involved in the CSI snapshot process other than initially creating the VolumeSnapshot resource. I can reproduce this issue independently of Velero by manually creating a VolumeSnapshot which I think excludes Velero's involvement in this.

In this case the snapshots themselves are not being reported as successful (but the underlying driver is successfully performing a snapshot).

But if I understand what you are saying this error message is set on the VolumeSnapshot resource by the driver (longhorn) and external-snapshotter is just reporting/relaying it?

Would you recommend opening an issue with longhorn?

phoenix-bjoern commented 11 months ago

@julienvincent The snapshot controller is only triggering a process which the storage driver then executes. Since the error seems to occur in the storage driver there is nothing you can in the snapshot controller. So yes, it makes sense to continue investigating the issue in the Longhorn components.

kain88-de commented 6 months ago

Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.

hoyho commented 5 months ago

Hi. We are running into the same issues. Have the UpdateStatus calls already been refactored to use apply instead of patch? If not I would like to help.

Having refactored last month. However, some of tests failed since then. So I took several days to fix them until now.

SCLogo commented 4 months ago

@hoyho did you have time to finish ? do you know any ETA? thanks

hoyho commented 4 months ago

@hoyho did you have time to finish ? do you know any ETA? thanks

No problems. Probably will do it next week

kubernetes-csi / external-snapshotter

snapshot-controller logs report failure frequently #748

Guidelines