longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.17k stars 604 forks source link

[BUG] Unable to restore backup (unsupported value) #7602

Open withinboredom opened 10 months ago

withinboredom commented 10 months ago

Describe the bug

When loading a backup (migrating a workload to a new cluster, both longhorn are version 1.5.3, exact same settings between each), the backup fails to be imported due to the following error (formatting mine):

Note: most other volumes were able to be migrated/restored just fine.

failed to create volume: unable to create volume pvc-309f5eb1-dff1-409f-9806-a43f8505b046: Volume.longhorn.io "pvc-309f5eb1-dff1-409f-9806-a43f8505b046" is invalid: 

[spec.snapshotDataIntegrity: Unsupported value: "": 
supported values: "ignored", "disabled", "enabled", "fast-check",

spec.backupCompressionMethod: Unsupported value: "": 
supported values: "none", "lz4", "gzip", 

spec.replicaAutoBalance: Unsupported value: "": 
supported values: "ignored", "disabled", "least-effort", "best-effort", 

spec.replicaSoftAntiAffinity: Unsupported value: "": 
supported values: "ignored", "enabled", "disabled", 

spec.unmapMarkSnapChainRemoved: Unsupported value: "": 
supported values: "ignored", "disabled", "enabled", 

spec.backendStoreDriver: Unsupported value: "": 
supported values: "v1", "v2", 

spec.offlineReplicaRebuilding: Unsupported value: "": 
supported values: "ignored", "disabled", "enabled", 

spec.replicaZoneSoftAntiAffinity: Unsupported value: "": 
supported values: "ignored", "enabled", "disabled", 

spec.dataLocality: Unsupported value: "": 
supported values: "disabled", "best-effort", "strict-local"]

I've checked the workarounds (thinking it was like #6582) but that does not apply.

To Reproduce

Not sure.

Expected behavior

To be able to restore a backup, even if an unsupported value is present (it should just use the default and show a warning, IMHO. Backups should always be able to be restored).

Support bundle for troubleshooting

Support bundles:

Too big to upload (147mb) but available upon request.

Environment

Additional context

james-munson commented 10 months ago

It will likely need a support bundle (maybe one from each cluster). Is it possible to mail it to longhorn-support-bundle@Suse.com?

derekbit commented 10 months ago

It looks the mutating webhook somehow doesn't work. @james-munson Can you help check this part? Thank you.

https://github.com/longhorn/longhorn-manager/blob/v1.5.3/webhook/resources/volume/mutator.go#L59-L61

james-munson commented 10 months ago

Agreement here. @ejweber notes (in Slack)

Ran some quick tests. It is fine to create a volume with those fields set to empty. Our mutating webhook mutates them BEFORE (I think) Kubernetes does any validation. However, deleting the mutatingwebhookconfiguration and then creating a volume CR with snapshotDataIntegrity: "" yields:

k apply -f volume.yaml 
The Volume "test" is invalid: spec.snapshotDataIntegrity: Unsupported value: "": supported values: "ignored", "disabled", "enabled", "fast-check"

I think the user's mutating webhook is broken.

and @PhanLe1010

Currently the webhook has failurePolicy: Fail so if the request fail at the mutationwebhook level I would expect a different error like fail to reach/connection refused. Aka, I agree that the manager and webhook are functional. Attention is returned to whether the MutatingWebhookConfiguration exist and if it has correct config.

withinboredom commented 10 months ago

I worked around the issue by deleting ALL backups (simply taking another backup wouldn't resolve the problem) and then taking a new backup. I just wanted to let you know that this worked for me.

However, it does concern me that this can happen in disaster recovery scenarios. Could this backup series be corrupted at some point and never resolved by taking more backups?

I was unable to send the support bundle to the email address. I'll upload it to s3 before this weekend.

github-actions[bot] commented 2 hours ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

derekbit commented 1 hour ago

I remember we've improved the webhook in v1.6.0. @james-munson Do you remember which one?