DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts

ywk253100 commented 11 months ago

Restart the Velero pod when the Backup CR is in InProgress status (the DataUpload CR is in Accept status), the Backup CR is marked as Failed when the Velero pod starts up again, but the DataUpload CR isn't canceled and after a while the Backup CR is marked as WaitingForPluginOperations then Completed.

And here is the final status of the backup with status as Completed but failureReason as found a backup with status "InProgress" during the server starting, mark it as "Failed":

status:
  backupItemOperationsAttempted: 1
  backupItemOperationsCompleted: 1
  completionTimestamp: "2023-12-19T07:50:44Z"
  expiration: "2024-01-18T07:49:14Z"
  failureReason: found a backup with status "InProgress" during the server starting,
    mark it as "Failed"
  formatVersion: 1.1.0
  hookStatus:
    hooksAttempted: 1
  phase: Completed
  progress:
    itemsBackedUp: 32
    totalItems: 32
  startTimestamp: "2023-12-19T07:49:15Z"
  version: 1

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

qiuming-best commented 10 months ago

It occurred at the moment during Velero deployment grace periods:

the terminating velero server pod still doing the data upload flow while the new start velero server pod is marking the "Failed" status of DataUpload CR simultaneously.

This corner case wouldn't happen If the velero pod is OOM killed, so we decided to postpone the repair for low priority.

reasonerjt commented 9 months ago

As @qiuming-best explained in this comment, this is not likely happen in real usage scenario. A possible solution to this problem is to introduce leader election mechanism so there won't be two velero servers working at the same time. This may be put into backlog, but not very important fo v1.14

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

kaovilai commented 6 months ago

unstale

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

kaovilai commented 4 months ago

unstale

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

blackpiglet commented 1 month ago

unstale

rkashasl commented 1 month ago

Have same issue, restart doesn't help it fails all the time with the same error

Name:         velero-daily-20241009130824
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-5.4.1
              helm.toolkit.fluxcd.io/name=velero
              helm.toolkit.fluxcd.io/namespace=velero
              velero.io/schedule-name=velero-daily
              velero.io/storage-location=default
Annotations:  meta.helm.sh/release-name: velero
              meta.helm.sh/release-namespace: velero
              velero.io/resource-timeout: 10m0s
              velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
              velero.io/source-cluster-k8s-major-version: 1
              velero.io/source-cluster-k8s-minor-version: 29+
API Version:  velero.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2024-10-09T13:08:24Z
  Generation:          3
  Resource Version:    264000873
  UID:                 f8aeda37-a39d-42ad-848e-7752c2667927
Spec:
  Csi Snapshot Timeout:          10m0s
  Default Volumes To Fs Backup:  false
  Hooks:
  Include Cluster Resources:  true
  Item Operation Timeout:     4h0m0s
  Metadata:
  Snapshot Move Data:  false
  Snapshot Volumes:    true
  Storage Location:    default
  Ttl:                 168h0m0s
  Volume Snapshot Locations:
    default
Status:
  Completion Timestamp:  2024-10-09T13:09:23Z
  Expiration:            2024-10-16T13:08:24Z
  Failure Reason:        found a backup with status "InProgress" during the server starting, mark it as "Failed"
  Format Version:        1.1.0
  Phase:                 Failed
  Start Timestamp:       2024-10-09T13:08:24Z
  Version:               1
Events:                  <none>

blackpiglet commented 1 month ago

@rkashasl I think your error is not related to this issue. Please try to enlarge the Velero deployment resource setting to resolve your issue.

rkashasl commented 1 month ago

@rkashasl I think your error is not related to this issue. Please try to enlarge the Velero deployment resource setting to resolve your issue.

Resources are fine

Hoever, when i completely removed velero from the cluster including all crds and then reconcile flux to get it back - all backups after provisioning have been completed successfully, but then i run command backup create --from-schedule velero-daily and check the backup status it went from InProgress to failed with same error as before

Name:         velero-daily-20241010070123
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-5.4.1
              helm.toolkit.fluxcd.io/name=velero
              helm.toolkit.fluxcd.io/namespace=velero
              velero.io/schedule-name=velero-daily
              velero.io/storage-location=default
Annotations:  meta.helm.sh/release-name: velero
              meta.helm.sh/release-namespace: velero
              velero.io/resource-timeout: 10m0s
              velero.io/source-cluster-k8s-gitversion: v1.29.8-eks-a737599
              velero.io/source-cluster-k8s-major-version: 1
              velero.io/source-cluster-k8s-minor-version: 29+
API Version:  velero.io/v1
Kind:         Backup
Metadata:
  Creation Timestamp:  2024-10-10T07:01:23Z
  Generation:          3
  Resource Version:    264911127
  UID:                 273a47b5-9b1a-4d01-a9c5-bc510e3b5c47
Spec:
  Csi Snapshot Timeout:          10m0s
  Default Volumes To Fs Backup:  false
  Hooks:
  Include Cluster Resources:  true
  Item Operation Timeout:     4h0m0s
  Metadata:
  Snapshot Move Data:  false
  Snapshot Volumes:    true
  Storage Location:    default
  Ttl:                 168h0m0s
  Volume Snapshot Locations:
    default
Status:
  Completion Timestamp:  2024-10-10T07:02:04Z
  Expiration:            2024-10-17T07:01:23Z
  Failure Reason:        found a backup with status "InProgress" during the server starting, mark it as "Failed"
  Format Version:        1.1.0
  Phase:                 Failed
  Start Timestamp:       2024-10-10T07:01:24Z
  Version:               1
Events:                  <none>

rkashasl commented 1 month ago

I increased memory requests to 1Gi and limits to 2Gi, also adjust cpu requests to 250m and all started to work as it should Btw, before i did that i noticed velero server restart during the backup procedure in pod logs However i don't think this is a clear understanding what is the problem as in grafana i don't see the cpu or memory utilization more than a 30% from the requests, maybe you can take this into account

vmware-tanzu / velero

DataUpload isn't canceled even the Backup is marked as "Failed" when Velero pod restarts #7230