vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Micro-service datamover backup recovery issue #8231

Open blackpiglet opened 2 months ago

blackpiglet commented 2 months ago

What steps did you take and what happened:

Create a backup for a workload with volumes, and the volume data is backed up by the CSI snapshot data mover.

Restart the Velero server pod when the backup is in the InProgress state.

The Backup ended in PartiallyFailed state, and some of the DataUploads ended as Completed.

What did you expect to happen: The backup should end in Failed state, and the DataUploads should be Cancelled.

The following information will help us better understand what's going on: Another worth notice is the Backup state changed from Failed to PartiallyFailed.

jxun@DH7PKQMYXW:s001-> /Users » jxun » go » src » github.com » vmware-tanzu » velero (0)
> _output/bin/darwin/arm64/velero backup create --include-namespaces upgrade --snapshot-move-data 1.13-12
Backup request "1.13-12" submitted successfully.
Run `velero backup describe 1.13-12` or `velero backup logs 1.13-12` for more details.
git❨ tags/v1.14.1 ❩ 

jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜    _output/bin/darwin/arm64/velero backup  get
NAME      STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
1.13-10   Completed    0        0          2024-09-19 21:51:11 +0800 CST   29d       default            <none>
1.13-12   InProgress   0        0          2024-09-19 21:58:27 +0800 CST   29d       default            <none>
jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜    _output/bin/darwin/arm64/velero backup  get
jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜   k -n velero rollout restart deployment/velero
deployment.apps/velero restarted

jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜   _output/bin/darwin/arm64/velero backup  get
NAME      STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
1.13-10   Completed   0        0          2024-09-19 21:51:11 +0800 CST   29d       default            <none>
1.13-12   Failed      0        0          2024-09-19 21:58:27 +0800 CST   29d       default            <none>
jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜   k -n velero get dataupload
NAME            STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
1.13-10-c7flp   Completed   6m14s     8589934592   8589934592    default            8m24s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-hw54t   Completed   4m19s     8589934592   8589934592    default            7m48s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-j7h2q   Completed   5m53s     8589934592   8589934592    default            8m19s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-xvn9s   Completed   6m4s      8589934592   8589934592    default            7m59s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-z4rb7   Completed   4m29s     8589934592   8589934592    default            7m33s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-zhzgb   Completed   5m16s     8589934592   8589934592    default            8m9s    gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-52q7d                                                    default            53s
1.13-12-6csmq                                                    default            32s
1.13-12-94kjp   Accepted                                         default            63s
1.13-12-ccfps   Accepted                                         default            68s
1.13-12-mv797                                                    default            42s
1.13-12-sr2lf                                                    default            22s

jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜   k -n velero get dataupload
NAME            STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
1.13-10-c7flp   Completed   7m1s      8589934592   8589934592    default            9m11s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-hw54t   Completed   5m6s      8589934592   8589934592    default            8m35s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-j7h2q   Completed   6m40s     8589934592   8589934592    default            9m6s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-xvn9s   Completed   6m51s     8589934592   8589934592    default            8m46s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-z4rb7   Completed   5m16s     8589934592   8589934592    default            8m20s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-zhzgb   Completed   6m3s      8589934592   8589934592    default            8m56s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-52q7d   Canceled    34s                                  default            100s
1.13-12-6csmq   Accepted                                         default            79s
1.13-12-94kjp   Canceled    35s                                  default            110s
1.13-12-ccfps   Canceled    29s                                  default            115s
1.13-12-mv797   Accepted                                         default            89s
1.13-12-sr2lf   Accepted                                         default            69s
jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜   k -n velero get dataupload -w
NAME            STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
1.13-10-c7flp   Completed   7m6s      8589934592   8589934592    default            9m16s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-hw54t   Completed   5m11s     8589934592   8589934592    default            8m40s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-j7h2q   Completed   6m45s     8589934592   8589934592    default            9m11s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-xvn9s   Completed   6m56s     8589934592   8589934592    default            8m51s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-z4rb7   Completed   5m21s     8589934592   8589934592    default            8m25s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-zhzgb   Completed   6m8s      8589934592   8589934592    default            9m1s    gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-52q7d   Canceled    39s                                  default            105s
1.13-12-6csmq   Accepted                                         default            84s
1.13-12-94kjp   Canceled    40s                                  default            115s
1.13-12-ccfps   Canceled    34s                                  default            2m
1.13-12-mv797   Accepted                                         default            94s
1.13-12-sr2lf   Accepted                                         default            74s

1.13-12-mv797   Prepared                                         default            117s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-mv797   InProgress   0s                                   default            117s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-mv797   InProgress   4s                     8589934592    default            2m1s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-mv797   InProgress   4s        8589934592   8589934592    default            2m1s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-mv797   Completed    10s       8589934592   8589934592    default            2m7s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-mv797   Completed    10s       8589934592   8589934592    default            2m7s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-6csmq   Prepared                                          default            2m9s    gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-6csmq   InProgress   0s                                   default            2m9s    gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-6csmq   InProgress   4s                     8589934592    default            2m13s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-6csmq   InProgress   5s        8589934592   8589934592    default            2m14s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-sr2lf   Prepared                                          default            2m7s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-sr2lf   InProgress   0s                                   default            2m7s    gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-6csmq   Completed    11s       8589934592   8589934592    default            2m20s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-6csmq   Completed    11s       8589934592   8589934592    default            2m20s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-sr2lf   InProgress   5s                     8589934592    default            2m12s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-sr2lf   InProgress   5s        8589934592   8589934592    default            2m12s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-sr2lf   Completed    11s       8589934592   8589934592    default            2m18s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-sr2lf   Completed    11s       8589934592   8589934592    default            2m18s   gke-jxun-2-default-pool-ae0c2179-kk61
^C%                                                                                                                                    jxun@DH7PKQMYXW: /Users/jxun/go/src/github.com/vmware-tanzu/velero git:(v1.14.1)
➜   k -n velero get dataupload
NAME            STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
1.13-10-c7flp   Completed   8m53s     8589934592   8589934592    default            11m     gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-hw54t   Completed   6m58s     8589934592   8589934592    default            10m     gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-10-j7h2q   Completed   8m32s     8589934592   8589934592    default            10m     gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-xvn9s   Completed   8m43s     8589934592   8589934592    default            10m     gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-z4rb7   Completed   7m8s      8589934592   8589934592    default            10m     gke-jxun-2-default-pool-ae0c2179-kk61
1.13-10-zhzgb   Completed   7m55s     8589934592   8589934592    default            10m     gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-52q7d   Canceled    2m26s                                default            3m32s
1.13-12-6csmq   Completed   62s       8589934592   8589934592    default            3m11s   gke-jxun-2-default-pool-0ec9c04d-mcjg
1.13-12-94kjp   Canceled    2m27s                                default            3m42s
1.13-12-ccfps   Canceled    2m21s                                default            3m47s
1.13-12-mv797   Completed   84s       8589934592   8589934592    default            3m21s   gke-jxun-2-default-pool-ae0c2179-kk61
1.13-12-sr2lf   Completed   54s       8589934592   8589934592    default            3m1s    gke-jxun-2-default-pool-ae0c2179-kk61

> _output/bin/darwin/arm64/velero backup describe 1.13-12 --details
Name:         1.13-12
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.31.0-gke.1506000
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=31

Phase:  PartiallyFailed (run `velero backup logs 1.13-12` for more information)

Errors:
  Velero:     <none>
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  upgrade
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          true
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-09-19 21:58:27 +0800 CST
Completed:  2024-09-19 22:01:48 +0800 CST

Expiration:  2024-10-19 21:58:27 +0800 CST

Total items to be backed up:  75
Items backed up:              75

Backup Item Operations:
  Operation for persistentvolumeclaims upgrade/my-pvc:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d72f2b7e-47c7-4d70-a250-821b9c811b76.f2d95385-bb9e-4ac796a03
    Items to Update:
                           datauploads.velero.io velero/1.13-12-ccfps
    Phase:                 Failed
    Operation Error:       DataUpload is canceled
    Progress description:  Canceled
    Created:               2024-09-19 21:58:33 +0800 CST
    Started:               2024-09-19 21:59:59 +0800 CST
    Updated:               2024-09-19 21:59:59 +0800 CST
  Operation for persistentvolumeclaims upgrade/my-pvc1:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d72f2b7e-47c7-4d70-a250-821b9c811b76.f61a4262-159c-44bc04e8d
    Items to Update:
                           datauploads.velero.io velero/1.13-12-94kjp
    Phase:                 Failed
    Operation Error:       DataUpload is canceled
    Progress description:  Canceled
    Created:               2024-09-19 21:58:38 +0800 CST
    Started:               2024-09-19 21:59:53 +0800 CST
    Updated:               2024-09-19 21:59:53 +0800 CST
  Operation for persistentvolumeclaims upgrade/my-pvc2:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d72f2b7e-47c7-4d70-a250-821b9c811b76.50874ede-48d6-4644c5774
    Items to Update:
                           datauploads.velero.io velero/1.13-12-52q7d
    Phase:                 Failed
    Operation Error:       DataUpload is canceled
    Progress description:  Canceled
    Created:               2024-09-19 21:58:48 +0800 CST
    Started:               2024-09-19 21:59:54 +0800 CST
    Updated:               2024-09-19 21:59:54 +0800 CST
  Operation for persistentvolumeclaims upgrade/my-pvc3:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d72f2b7e-47c7-4d70-a250-821b9c811b76.a47b1f33-e30a-4ddb1ceef
    Items to Update:
                           datauploads.velero.io velero/1.13-12-mv797
    Phase:                 Completed
    Progress:              8589934592 of 8589934592 complete (Bytes)
    Progress description:  Completed
    Created:               2024-09-19 21:58:59 +0800 CST
    Started:               2024-09-19 22:00:56 +0800 CST
    Updated:               2024-09-19 22:01:06 +0800 CST
  Operation for persistentvolumeclaims upgrade/my-pvc4:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d72f2b7e-47c7-4d70-a250-821b9c811b76.e2f26717-7346-44e5002c3
    Items to Update:
                           datauploads.velero.io velero/1.13-12-6csmq
    Phase:                 Completed
    Progress:              8589934592 of 8589934592 complete (Bytes)
    Progress description:  Completed
    Created:               2024-09-19 21:59:09 +0800 CST
    Started:               2024-09-19 22:01:18 +0800 CST
    Updated:               2024-09-19 22:01:29 +0800 CST
  Operation for persistentvolumeclaims upgrade/my-pvc5:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d72f2b7e-47c7-4d70-a250-821b9c811b76.104b928d-c684-49c410e5f
    Items to Update:
                           datauploads.velero.io velero/1.13-12-sr2lf
    Phase:                 Completed
    Progress:              8589934592 of 8589934592 complete (Bytes)
    Progress description:  Completed
    Created:               2024-09-19 21:59:19 +0800 CST
    Started:               2024-09-19 22:01:26 +0800 CST
    Updated:               2024-09-19 22:01:37 +0800 CST
Resource List:
  apps/v1/Deployment:
    - upgrade/hello-app
    - upgrade/hello-app1
    - upgrade/hello-app2
    - upgrade/hello-app3
    - upgrade/hello-app4
    - upgrade/hello-app5
  apps/v1/ReplicaSet:
    - upgrade/hello-app-fc8f88bf8
    - upgrade/hello-app1-6864477c56
    - upgrade/hello-app2-75c7bbfc98
    - upgrade/hello-app3-9ff446796
    - upgrade/hello-app4-565c8c9ffd
    - upgrade/hello-app5-57c7f4b44
  v1/ConfigMap:
    - upgrade/kube-root-ca.crt
  v1/Event:
    - upgrade/hello-app-fc8f88bf8-ws4fl.17f6971949b472de
    - upgrade/hello-app-fc8f88bf8-ws4fl.17f6971952e73f69
    - upgrade/hello-app-fc8f88bf8-ws4fl.17f6971957991295
    - upgrade/hello-app-fc8f88bf8-ws4fl.17f6a7795192fa65
    - upgrade/hello-app1-6864477c56-m59bn.17f697e345617e4a
    - upgrade/hello-app1-6864477c56-m59bn.17f697e34f7cb1be
    - upgrade/hello-app1-6864477c56-m59bn.17f697e355fb9c36
    - upgrade/hello-app1-6864477c56-m59bn.17f6a843387ed4ee
    - upgrade/hello-app2-75c7bbfc98-bvh2j.17f6971a34325726
    - upgrade/hello-app2-75c7bbfc98-bvh2j.17f6971a3e917a4a
    - upgrade/hello-app2-75c7bbfc98-bvh2j.17f6971a46704c25
    - upgrade/hello-app2-75c7bbfc98-bvh2j.17f6a77a4108d926
    - upgrade/hello-app3-9ff446796-9v6lw.17f6971ad89da54a
    - upgrade/hello-app3-9ff446796-9v6lw.17f6971ae1190839
    - upgrade/hello-app3-9ff446796-9v6lw.17f6971ae6402028
    - upgrade/hello-app3-9ff446796-9v6lw.17f6a77af2c089bf
    - upgrade/hello-app4-565c8c9ffd-7hbb2.17f6971b4c9476d7
    - upgrade/hello-app4-565c8c9ffd-7hbb2.17f6971b56dcfc04
    - upgrade/hello-app4-565c8c9ffd-7hbb2.17f6971b5d00e00b
    - upgrade/hello-app4-565c8c9ffd-7hbb2.17f6a77b26774e14
    - upgrade/hello-app5-57c7f4b44-msqgk.17f6971bee26e66f
    - upgrade/hello-app5-57c7f4b44-msqgk.17f6971bf9230570
    - upgrade/hello-app5-57c7f4b44-msqgk.17f6971bfdd86df5
    - upgrade/hello-app5-57c7f4b44-msqgk.17f6a77be41c395a
    - upgrade/velero-my-pvc-ndftk.17f6a97f8f98231b
    - upgrade/velero-my-pvc-ndftk.17f6a9807d83ac0a
    - upgrade/velero-my-pvc-ndftk.17f6a991343cd9a1
    - upgrade/velero-my-pvc1-skfx2.17f6a980e8badf5c
    - upgrade/velero-my-pvc1-skfx2.17f6a982071e4757
    - upgrade/velero-my-pvc1-skfx2.17f6a991fea20e34
    - upgrade/velero-my-pvc2-mnwbm.17f6a982367bec51
    - upgrade/velero-my-pvc2-mnwbm.17f6a9842a23a9b9
    - upgrade/velero-my-pvc2-mnwbm.17f6a9950594eb2a
    - upgrade/velero-my-pvc3-8bmzz.17f6a984a1ff3591
    - upgrade/velero-my-pvc3-8bmzz.17f6a9865ada2853
    - upgrade/velero-my-pvc3-8bmzz.17f6a99559018c14
    - upgrade/velero-my-pvc4-xlvlb.17f6a987198f2444
    - upgrade/velero-my-pvc4-xlvlb.17f6a989cc8d3342
    - upgrade/velero-my-pvc4-xlvlb.17f6a997c4e3ebad
    - upgrade/velero-my-pvc5-fntc9.17f6a989918d018c
    - upgrade/velero-my-pvc5-fntc9.17f6a98c209d3589
    - upgrade/velero-my-pvc5-fntc9.17f6a999dd4fbe63
  v1/Namespace:
    - upgrade
  v1/PersistentVolume:
    - pvc-104b928d-c684-49ca-8d35-e9e95459afb9
    - pvc-50874ede-48d6-4641-be12-a3bca2797d12
    - pvc-a47b1f33-e30a-4dd5-a142-d3a74c0bada2
    - pvc-e2f26717-7346-44e2-afad-6ef3f196b4ce
    - pvc-f2d95385-bb9e-4ac4-9a64-3c02d9cdfe3b
    - pvc-f61a4262-159c-44ba-a68d-b2dd13d81636
  v1/PersistentVolumeClaim:
    - upgrade/my-pvc
    - upgrade/my-pvc1
    - upgrade/my-pvc2
    - upgrade/my-pvc3
    - upgrade/my-pvc4
    - upgrade/my-pvc5
  v1/Pod:
    - upgrade/hello-app-fc8f88bf8-ws4fl
    - upgrade/hello-app1-6864477c56-m59bn
    - upgrade/hello-app2-75c7bbfc98-bvh2j
    - upgrade/hello-app3-9ff446796-9v6lw
    - upgrade/hello-app4-565c8c9ffd-7hbb2
    - upgrade/hello-app5-57c7f4b44-msqgk
  v1/ServiceAccount:
    - upgrade/default

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots:
    upgrade/my-pvc2:
      Data Movement:
        Operation ID: du-d72f2b7e-47c7-4d70-a250-821b9c811b76.50874ede-48d6-4644c5774
        Data Mover: velero
        Uploader Type: kopia
        Moved data Size (bytes): 0
        Result: failed
    upgrade/my-pvc3:
      Data Movement:
        Operation ID: du-d72f2b7e-47c7-4d70-a250-821b9c811b76.a47b1f33-e30a-4ddb1ceef
        Data Mover: velero
        Uploader Type: kopia
        Moved data Size (bytes): 8589934592
        Result: succeeded
    upgrade/my-pvc4:
      Data Movement:
        Operation ID: du-d72f2b7e-47c7-4d70-a250-821b9c811b76.e2f26717-7346-44e5002c3
        Data Mover: velero
        Uploader Type: kopia
        Moved data Size (bytes): 8589934592
        Result: succeeded
    upgrade/my-pvc5:
      Data Movement:
        Operation ID: du-d72f2b7e-47c7-4d70-a250-821b9c811b76.104b928d-c684-49c410e5f
        Data Mover: velero
        Uploader Type: kopia
        Moved data Size (bytes): 8589934592
        Result: succeeded
    upgrade/my-pvc:
      Data Movement:
        Operation ID: du-d72f2b7e-47c7-4d70-a250-821b9c811b76.f2d95385-bb9e-4ac796a03
        Data Mover: velero
        Uploader Type: kopia
        Moved data Size (bytes): 0
        Result: failed
    upgrade/my-pvc1:
      Data Movement:
        Operation ID: du-d72f2b7e-47c7-4d70-a250-821b9c811b76.f61a4262-159c-44bc04e8d
        Data Mover: velero
        Uploader Type: kopia
        Moved data Size (bytes): 0
        Result: failed

  Pod Volume Backups: <none included>

HooksAttempted:  0
HooksFailed:     0

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help bundle-2024-09-19-22-59-51.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

ywk253100 commented 2 months ago

The possible cause is that: When restarting the Velero server by kubectl delete pod or kubectl rollout restart, there are two Velero servers running in a short time (the old one is in Terminating but not deleted completely). The new Velero server marks the Backup as Failed when starting up while the old one updates backup's status to WaitingForPluginOperations. Then because the DataUpload is canceled, the backup is updated to PartaillyFailed at last.

ywk253100 commented 2 months ago

One possible improvement for upgrading use case is to change the deployment strategy of Velero server from RollingUpdate to Recreate, this will make sure the old Velero server is stopped before creating the new one.

sseago commented 2 months ago

@ywk253100 another possibility is that the backup just happened to move from InProgress to WaitingForPluginOperations just as the pod was being killed, so that transition happened right before killing it. But yes, we probably do want the old server stopped before starting the new one. If both are running at the same time or even a brief period, unpredictable things can happen.