Open ywk253100 opened 11 months ago
It occurred at the moment during Velero deployment grace periods:
the terminating velero server pod still doing the data upload flow while the new start velero server pod is marking the "Failed" status of DataUpload CR simultaneously.
This corner case wouldn't happen If the velero pod is OOM killed, so we decided to postpone the repair for low priority.
As @qiuming-best explained in this comment, this is not likely happen in real usage scenario. A possible solution to this problem is to introduce leader election mechanism so there won't be two velero servers working at the same time. This may be put into backlog, but not very important fo v1.14
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
unstale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
unstale
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
unstale
Have same issue, restart doesn't help it fails all the time with the same error
Name: velero-daily-20241009130824
Namespace: velero
Labels: app.kubernetes.io/instance=velero
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=velero
helm.sh/chart=velero-5.4.1
helm.toolkit.fluxcd.io/name=velero
helm.toolkit.fluxcd.io/namespace=velero
velero.io/schedule-name=velero-daily
velero.io/storage-location=default
Annotations: meta.helm.sh/release-name: velero
meta.helm.sh/release-namespace: velero
velero.io/resource-timeout: 10m0s
velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
velero.io/source-cluster-k8s-major-version: 1
velero.io/source-cluster-k8s-minor-version: 29+
API Version: velero.io/v1
Kind: Backup
Metadata:
Creation Timestamp: 2024-10-09T13:08:24Z
Generation: 3
Resource Version: 264000873
UID: f8aeda37-a39d-42ad-848e-7752c2667927
Spec:
Csi Snapshot Timeout: 10m0s
Default Volumes To Fs Backup: false
Hooks:
Include Cluster Resources: true
Item Operation Timeout: 4h0m0s
Metadata:
Snapshot Move Data: false
Snapshot Volumes: true
Storage Location: default
Ttl: 168h0m0s
Volume Snapshot Locations:
default
Status:
Completion Timestamp: 2024-10-09T13:09:23Z
Expiration: 2024-10-16T13:08:24Z
Failure Reason: found a backup with status "InProgress" during the server starting, mark it as "Failed"
Format Version: 1.1.0
Phase: Failed
Start Timestamp: 2024-10-09T13:08:24Z
Version: 1
Events: <none>
@rkashasl I think your error is not related to this issue. Please try to enlarge the Velero deployment resource setting to resolve your issue.
@rkashasl I think your error is not related to this issue. Please try to enlarge the Velero deployment resource setting to resolve your issue.
Resources are fine
Hoever, when i completely removed velero from the cluster including all crds and then reconcile flux to get it back - all backups after provisioning have been completed successfully, but then i run command
backup create --from-schedule velero-daily
and check the backup status it went from InProgress to failed with same error as before
Name: velero-daily-20241010070123
Namespace: velero
Labels: app.kubernetes.io/instance=velero
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=velero
helm.sh/chart=velero-5.4.1
helm.toolkit.fluxcd.io/name=velero
helm.toolkit.fluxcd.io/namespace=velero
velero.io/schedule-name=velero-daily
velero.io/storage-location=default
Annotations: meta.helm.sh/release-name: velero
meta.helm.sh/release-namespace: velero
velero.io/resource-timeout: 10m0s
velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
velero.io/source-cluster-k8s-major-version: 1
velero.io/source-cluster-k8s-minor-version: 29+
API Version: velero.io/v1
Kind: Backup
Metadata:
Creation Timestamp: 2024-10-10T07:01:23Z
Generation: 3
Resource Version: 264911127
UID: 273a47b5-9b1a-4d01-a9c5-bc510e3b5c47
Spec:
Csi Snapshot Timeout: 10m0s
Default Volumes To Fs Backup: false
Hooks:
Include Cluster Resources: true
Item Operation Timeout: 4h0m0s
Metadata:
Snapshot Move Data: false
Snapshot Volumes: true
Storage Location: default
Ttl: 168h0m0s
Volume Snapshot Locations:
default
Status:
Completion Timestamp: 2024-10-10T07:02:04Z
Expiration: 2024-10-17T07:01:23Z
Failure Reason: found a backup with status "InProgress" during the server starting, mark it as "Failed"
Format Version: 1.1.0
Phase: Failed
Start Timestamp: 2024-10-10T07:01:24Z
Version: 1
Events: <none>
I increased memory requests to 1Gi and limits to 2Gi, also adjust cpu requests to 250m and all started to work as it should Btw, before i did that i noticed velero server restart during the backup procedure in pod logs However i don't think this is a clear understanding what is the problem as in grafana i don't see the cpu or memory utilization more than a 30% from the requests, maybe you can take this into account
Restart the Velero pod when the
Backup
CR is inInProgress
status (theDataUpload
CR is inAccept
status), theBackup
CR is marked asFailed
when the Velero pod starts up again, but theDataUpload
CR isn't canceled and after a while theBackup
CR is marked asWaitingForPluginOperations
thenCompleted
.And here is the final status of the backup with status as
Completed
but failureReason asfound a backup with status "InProgress" during the server starting, mark it as "Failed"
:Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.