vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

"Failed to find previous kopia snapshot manifests" for velero backup using csi snapshot data mover #8222

Open dharanui opened 2 months ago

dharanui commented 2 months ago

What steps did you take and what happened: velero version 1.12 also tried upgrading to 1.14.1 aws plugin 1.10.1

scheduled backup runs everyday with csi datamover.

backups are in partiallyFailed status intermittently with few datauploads in "Failed" state. On describing the failed datauploads we find the following error.

data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c": failed to get blob with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c: BLOB not found

What did you expect to happen: All datauploads are in completed state and Backups are successful everyday.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help bundle-2024-09-17-11-32-20.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment: Production environment

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 2 months ago

Looks like the repo content with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c was missing, so the following backups for the same snapshot represented by this content always failed.

Take example for volume datadir-active-tracking-mongodb-secondary-0:

Therefore, looks like the content was not found at one time but found at the other time.

The BLOB not found was reported by the object store. Therefore, please share the info of your object store. And you can also check if object qc9e6d67bcf92c91ed614a2d63a27ce00-s670d20f3fffaf54212c exists in your object store.

dharanui commented 2 months ago

"Therefore, looks like the content was not found at one time but found at the other time." -> True, I have triggered a backup almost immediately after the backup is failed at "2024-09-17T02:09:58Z".

We use AWS S3 as object store. I am not sure how to check the object in object store. But if it was not present how will the subsequent backup succeed?

backup repository example:

apiVersion: velero.io/v1
kind: BackupRepository
metadata:
  creationTimestamp: "2024-08-29T11:06:27Z"
  generateName: active-tracking-default-kopia-
  generation: 480
  labels:
    velero.io/repository-type: kopia
    velero.io/storage-location: default
    velero.io/volume-namespace: active-tracking
  name: active-tracking-default-kopia-svqtc
  namespace: velero
  resourceVersion: "218166016"
  uid: 5175b990-5122-4dc6-8813-11e3db56ebb6
spec:
  backupStorageLocation: default
  maintenanceFrequency: 1h0m0s
  repositoryType: kopia
  resticIdentifier: s3:s3-eu-west-1.amazonaws.com/ot3-qa-patch-velero-backup/restic/active-tracking
  volumeNamespace: active-tracking
status:
  lastMaintenanceTime: "2024-09-18T08:16:56Z"
  phase: Ready
Lyndon-Li commented 2 months ago

I also think the object should exist all the time. But the object store just returned 404 (BLOB not found for Kopia) at the time when the problem happened. I think that is the problem.

dharanui commented 2 months ago

corresponding log example from velero pod: "level=error msg="data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob \"qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c\": failed to get blob with ID qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c: BLOB not found, plugin: velero.io/csi-pvc-backupper" backup=velero/daily-bkp-20240918033004 logSource="pkg/controller/backup_controller.go:663"

Is there any way to tell the plugin to retry in case of 404? Or what are other places to check this?

dharanui commented 2 months ago

this seems to happen quite frequently in multiple clusters. Tried with velero version 1.12 and 1.14 also aws-plugin versions.

blackpiglet commented 2 months ago

Another notice-worthy thing is please check whether the snapshots can be shared across clusters.

dharanui commented 2 months ago

the volumeSnapshots will be deleted after the dataupload is successful right? But however these (volumes and snapshots) can be shared across clusters. We do this testing on everyday basis.

blackpiglet commented 2 months ago

The VolumeSnapshots are reset to retain the snapshot before deletion, so that is not the problem. I mean please check whether the cloud provider can support sharing the snapshots between clusters, but you already confirm that is not an issue in your environment in the previous reply.

I think we can use kopia CLI to check whether the snapshot exist in the object store. q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c.

dharanui commented 2 months ago

@blackpiglet / @Lyndon-Li do we have any workaround for this at the moment?

Found the following warnings on node-agent is it related?

node-agent time="2024-09-24T02:02:58Z" level=warning msg="active indexes [xr0_7_5fc616defe00f04ddd4cbb1444f7e8c7-sc4455fdf2d353fa5-c1 xn9_07592f6e33cbfb35b171e26f3450747c-sc5a932c3e062da0f12c-c1 xn9_10f827950e25c36b2405e63cf75a226b-s65ef87e5c616798712c-c1 xn9_14ddd7437f5720c24b237bc0deb22bf8-se901f10811af4f5f12d-c1 xn9_16e195a514cbba289a0ff3ef06db5d6f-s9614bd11b68aa6b212c-c1 xn9_1b0cb5d825399323a430a121f34476eb-s588095dfbac520c412c-c1 xn9_1e9549c8004c8c7268bad88419beab36-s9e4c96b059eddf3712d-c1 xn9_363917e4ac30525628976b596ef064a6-s5db7801afbadff7212c-c1 xn9_3a694490dfde37d62dae07b9b44ed0c2-sd1bc0aaf5b64e17112d-c1 xn9_3c0d542715d47c6c98902b309a90b251-sa8791d496c562fc112c-c1 xn9_3e6be1beb10d8d4e1c912caab27c3e5d-s755589140036a22012c-c1 xn9_53b9bf91278eb5c12346df5d63674faf-sc9f612ebdbd2a0ce12c-c1 xn9_639b5b6a8082b62c25c4a3935d50b6d7-sde1a45d58364560812d-c1 xn9_65fab6b9e1f26eaec6aa67521b1f78af-s093e4037cd9fc77b12c-c1 xn9_7ce07bd6999c9b6c1faa68af67987f87-s035cfd62c5d3644e12d-c1 xn9_81ac4312bb80f0d1186b231087b23f05-s7b90ce81329cb91812c-c1 xn9_8209b3b790a161761c84389a5f194002-s62f35ccd1cb1a0cb12c-c1 xn9_cd23b9ad0846ddf02f3d760bf5ace8dc-s7cdcedd28e8b24a612c-c1 xn9_cda8b06ba4a871a6075bae991eda8111-s0d0bce2d983eadcc12c-c1 xn9_d17d9345eea8d30e44182a388c48efcd-s9cdf0b0cdbd99eaa12d-c1 xn9_d3257f5446f4da22f224f753f68ee432-sdc539f6c6c87654012c-c1 xn9_dd532a5b43313a79639213de2464e8b8-s0db90ab1c3d9fb2412d-c1 xn9_e41bfe5b368eadec73cca2640ca5a9d2-s4d49b708bddaff8312c-c1 xn9_e838905b2a045727172d1518c04d9037-s43808c5f5d632ecf12c-c1 xn9_e9fe6494f98a8bc416517fe754a40383-sa2068559f11620a612d-c1 xn9_fb57dc5c187a4de3eec07fb7911b5b3d-secb786f3edaf429e12c-c1 xn9_fe21260253354ede021f822921b376c1-s0c06d9592eb3d69b12c-c1 xn10_00609770d0c78ff21286ce008d6b8ed7-sfe0a567df5eff63b12d-c1 xn10_08418a307bad786065e08baa65bb8d74-s494cde92e1b5f13012d-c1 xn10_092752e1ffa9226f61fce1308201541e-sfcd5cedcfd2cb4d212d-c1 xn10_1dce7427ad64f225d65692ccdedd1e3a-s2a85c940c8f73af612d-c1 xn10_3819707c2f4c6e6c34114cd409345f34-sb168d866e512d21312d-c1 xn10_4670b1f09e6f5394fff9c8ff45071509-sa2cad73ac0e9fdb512d-c1 xn10_4d74081a46fdbc4903964b5a23e7f921-s2fa4528c987552e712d-c1 xn10_4ff2621b3827a3ceb351ad7ba37d0a9e-s4db7aa7af554dc7a12d-c1 xn10_5bfb3687c48dd1abbc6add70c42d1b3f-sd41468b08298693c12d-c1 xn10_89b031dbcc0eec199c40d95eeaba16ee-s3650b95ad5c3489e12d-c1 xn10_8a7222cc2486e07f8cdf1e3a783da8e8-se76da0199f2b97cc12d-c1 xn10_95349a5f4bed81e32ffef280da1a14e7-scfaa896c1d9d776a12d-c1 xn10_9c586fe9a6e4a504bd9077b1620b2c21-s74c14ec8fe45d41212d-c1 xn10_a82e80af983db869d21fa7288b868f9f-s4f92e15e30b2f9b512d-c1 xn10_ad008aaf537636d340ebb263e914cbae-sbe46290e8f673d2c12d-c1 xn10_b0e2829139c853137c673e3a86635fcf-sd31964a47c2bd32d12d-c1 xn10_bb905a8b8e266cd6994271d6065700f4-sbae8b502911ff4c312d-c1 xn10_dd76752723fb1d4deda22600d2868270-se11374fe4c1a0d8412d-c1 xn10_df433df44424bd5c167d7eee7948921a-scb663fe2e899520a12d-c1 xn10_ee77f8f18fa5207191d5e1d8b718bad5-sd823c3c30eae664a12d-c1] deletion watermark 2024-09-22 20:59:27 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20240924020030-9qz56 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
Lyndon-Li commented 2 months ago

The warning message is not relevant. The weird thing is that the problem happens to your env only (we don't see other users report this problem) and even in your env, the problem is not always happen.

So we need more info to get the root cause. Please try to answer below questions:

  1. Have you enabled any advanced feature of AWS s3?
  2. Can you restore from every backup when you see this problem?
  3. Have you deleted any backup before the problem happens?
  4. Even if a blob is deleted, e.g, due to backup deletion, the blob won't be deleted immediately from the object store, instead, it is deleted along with a maintenance job after a safe enough time (e.g., 24 hours). So can you check when the missing snapshot is generated?
dharanui commented 2 months ago

Hi @Lyndon-Li

  1. No advanced featured enabled for the s3
  2. yes
  3. Yes we have deleted multiple backups as we created test backups to debug few issues and deleted them after they are done. Also our backups are stored for 14 days so the ones older than 14 days are deleted by setting ttl.
  4. I guess maintainence job by default is 1hr and not 24 hrs for kopia. Could that be potential problem?
dharanui commented 1 month ago

Hi @Lyndon-Li , We had a cluster where we did not do any manual activity of deletion of backups. But still we saw this error there. we also increased the maintainence frequency to 24h. one observation is that this error is not as often in other clusters as in production cluster, but ofcourse that is most important for us. Only difference is the amount of data.

Lyndon-Li commented 1 month ago

For this question:

Can you restore from every backup when you see this problem?

If the answer is Yes, it means the repository snapshot is not actually missing, otherwise, the restore will fail. Therefore, I still suspect it is an intermittent S3 API problem, so a retry may fix the problem. But we need to find some way to prove this suspection.

dharanui commented 1 month ago

my bad, we do restore for single namespace whose dataupload is completed. The one with failed dataupload will not be able to get restored. the dataupload goes to accepted state, once volumesnapshot is ready, it goes to prepared and failed. Dataupload is not happening at all. Corresponding node-agent logs for one such dataupload:

node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Exposed snapshot is ready and creating data path routine" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:258"
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Data upload is marked as in progress" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:284"
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Run cancelable dataUpload" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:330"
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Got volume dir" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/exposer/host_path.go:48" pod UID=163a5148-bfb3-4b69-9823-1dec50872a9b pod name=daily-bkp-20241009020033-4m5h5 volDir=pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount volume=145239de-a210-4013-bc7d-284b951ee1af
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Found path matching glob" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/exposer/host_path.go:68" path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" pod UID=163a5148-bfb3-4b69-9823-1dec50872a9b pod name=daily-bkp-20241009020033-4m5h5 volume=145239de-a210-4013-bc7d-284b951ee1af
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=warning msg="active indexes [xr0_7_84846fd7cc32b00fa4316eac1e20db72-sb7fcd8d42039be18-c1 xr8_15_7b11bfb73873562d88271496e6c1b679-sb1b1112075daef90-c1 xs16_2bbaddef226d82434a22149961920e78-s62d41a382d4535ca-c1 xs17_1c39a21660639ef45ba2cc3c58a2cf25-s0fd0a71432e054fd-c1 xs18_6132fd91e15cb7f46817d115a3b4ae3b-s436414805fbfc0aa-c1 xn21_173a6f72e027d6888451dd0f5a6cda8e-saa09c65e841ff1e812d-c1 xn21_22ff5eab22781ccf94a89d7494ad5859-s7a9921e94fa8480912d-c1 xn21_24f6fd0fd274b56914db7f868f9ece0b-s73a82e8615c6a9db12d-c1 xn21_29b0b8786c8cc03c2d2d9961e15f059b-s531ed460d509a9da12d-c1 xn21_2f0d9f474daeba513c6d23a688a6714a-s382d625348e91fdb12d-c1 xn21_49863f2f55bc69bb284332fadc764b4c-s6338c9e8fcdc008e12d-c1 xn21_4e8c03f34c1bd60cf69f0937a3aab842-sddb17f391444081112d-c1 xn21_54015f93e74f681e051727a191e8e255-s1701224ea10a9a6e12d-c1 xn21_5ea4a9b43741958c25b23cc761ff8251-s783f40d755bfa63f12d-c1 xn21_75013d00120378b3aec3a22276877061-s24cd8bd367ad33d612d-c1 xn21_a61c404a0213a92d3ee8db2e28a35dd6-s6be0a35f810f81b912d-c1 xn21_afd955cde2be6d83387dfc8a1c0dac0a-s505acb1abe49bc2c12d-c1 xn21_b282a7b45fcaeaa86ef456489b1c833f-sdeecdd07ef64e5f412d-c1 xn21_c309dc18c2642c1fcc5666aae504b0fc-s01763ba0fb0f134612d-c1 xn21_da48c980eb3fa44b2da7cdca4fa4aa1e-sd7de2efd7d1fe3d512d-c1 xn21_de03cd1f532f579e978df2093cc9666f-s830fae6c5f69286312d-c1 xn21_e1b6342274b334993060438f3ea72c4a-sbedc98d4b9c0246b12d-c1 xn21_e3f779d92a06186952729f0c876b89fa-s6804b8214cdb089d12d-c1 xn21_eed63f9ac30738915e688af81b08bed0-sff092b35e86d7f5412d-c1 xn21_f58d7bf065d12d1d5751fc6c702da50a-s1ef543d920a044b412d-c1 xn22_0c955b3d4dc6deaadf2f98bb0cd3ba46-s61a04a4b8ca16bd812d-c1 xn22_3f51316ef40a4056674840e3d3880d90-s2105294b2fcc651a12d-c1 xn22_a8ae76caecea418d588dfe12af6cb564-s3e34f1f30440989d12d-c1 xn22_bb182b1341393233b0330e64c5d9aa58-sbb40caf3611baa2212d-c1 xn22_d8b165d11a05c1135020821dbdaee30a-s96d351948d4a5a8312d-c1 xn22_ff61b04a77d7b756452d81aa62b44727-s85d5a317f30ed81812d-c1] deletion watermark 2024-10-08 01:07:32 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" sublevel=error
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Opening backup repo" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/provider/kopia.go:78" repoUID=922a870e-4d96-424f-ad21-75c6f471582c
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=warning msg="active indexes [xr0_7_84846fd7cc32b00fa4316eac1e20db72-sb7fcd8d42039be18-c1 xr8_15_7b11bfb73873562d88271496e6c1b679-sb1b1112075daef90-c1 xs16_2bbaddef226d82434a22149961920e78-s62d41a382d4535ca-c1 xs17_1c39a21660639ef45ba2cc3c58a2cf25-s0fd0a71432e054fd-c1 xs18_6132fd91e15cb7f46817d115a3b4ae3b-s436414805fbfc0aa-c1 xn21_173a6f72e027d6888451dd0f5a6cda8e-saa09c65e841ff1e812d-c1 xn21_22ff5eab22781ccf94a89d7494ad5859-s7a9921e94fa8480912d-c1 xn21_24f6fd0fd274b56914db7f868f9ece0b-s73a82e8615c6a9db12d-c1 xn21_29b0b8786c8cc03c2d2d9961e15f059b-s531ed460d509a9da12d-c1 xn21_2f0d9f474daeba513c6d23a688a6714a-s382d625348e91fdb12d-c1 xn21_49863f2f55bc69bb284332fadc764b4c-s6338c9e8fcdc008e12d-c1 xn21_4e8c03f34c1bd60cf69f0937a3aab842-sddb17f391444081112d-c1 xn21_54015f93e74f681e051727a191e8e255-s1701224ea10a9a6e12d-c1 xn21_5ea4a9b43741958c25b23cc761ff8251-s783f40d755bfa63f12d-c1 xn21_75013d00120378b3aec3a22276877061-s24cd8bd367ad33d612d-c1 xn21_a61c404a0213a92d3ee8db2e28a35dd6-s6be0a35f810f81b912d-c1 xn21_afd955cde2be6d83387dfc8a1c0dac0a-s505acb1abe49bc2c12d-c1 xn21_b282a7b45fcaeaa86ef456489b1c833f-sdeecdd07ef64e5f412d-c1 xn21_c309dc18c2642c1fcc5666aae504b0fc-s01763ba0fb0f134612d-c1 xn21_da48c980eb3fa44b2da7cdca4fa4aa1e-sd7de2efd7d1fe3d512d-c1 xn21_de03cd1f532f579e978df2093cc9666f-s830fae6c5f69286312d-c1 xn21_e1b6342274b334993060438f3ea72c4a-sbedc98d4b9c0246b12d-c1 xn21_e3f779d92a06186952729f0c876b89fa-s6804b8214cdb089d12d-c1 xn21_eed63f9ac30738915e688af81b08bed0-sff092b35e86d7f5412d-c1 xn21_f58d7bf065d12d1d5751fc6c702da50a-s1ef543d920a044b412d-c1 xn22_0c955b3d4dc6deaadf2f98bb0cd3ba46-s61a04a4b8ca16bd812d-c1 xn22_3f51316ef40a4056674840e3d3880d90-s2105294b2fcc651a12d-c1 xn22_a8ae76caecea418d588dfe12af6cb564-s3e34f1f30440989d12d-c1 xn22_bb182b1341393233b0330e64c5d9aa58-sbb40caf3611baa2212d-c1 xn22_d8b165d11a05c1135020821dbdaee30a-s96d351948d4a5a8312d-c1 xn22_ff61b04a77d7b756452d81aa62b44727-s85d5a317f30ed81812d-c1] deletion watermark 2024-10-08 01:07:32 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" sublevel=error
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="FileSystemBR is initialized" bsl=default controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 jobName=daily-bkp-20241009020033-4m5h5 logSource="pkg/datapath/file_system.go:110" repository=kopia source namespace=active-tracking uploader=kopia
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="fs init" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:341" path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Async fs backup data path started" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:350" path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Reconcile daily-bkp-20241009020033-4m5h5" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:113"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:292"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Starting backup" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/provider/kopia.go:144" parentSnapshot= path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" realSource=active-tracking/datadir-active-tracking-mongodb-primary-0
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Start to snapshot..." controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/kopia/snapshot.go:230" parentSnapshot= path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" realSource=active-tracking/datadir-active-tracking-mongodb-primary-0
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Searching for parent snapshot" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/kopia/snapshot.go:245" parentSnapshot= path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" realSource=active-tracking/datadir-active-tracking-mongodb-primary-0
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=error msg="Async fs backup data path failed" dataupload=daily-bkp-20241009020033-4m5h5 error="Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID q504b6cb0fdaf00c81b2b2ecc15e9be36-s032bc87b4027220412d: BLOB not found" error.file="/go/pkg/mod/github.com/project-velero/kopia@v0.13.0-velero.1/repo/blob/storage.go:186" error.function=github.com/kopia/kopia/repo/blob.init logSource="pkg/controller/data_upload_controller.go:403"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Action finished" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/provider/kopia.go:91"
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=warning msg="Failed to patch dataupload with err Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID q504b6cb0fdaf00c81b2b2ecc15e9be36-s032bc87b4027220412d: BLOB not found" dataupload=daily-bkp-20241009020033-4m5h5 error="Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID q504b6cb0fdaf00c81b2b2ecc15e9be36-s032bc87b4027220412d: BLOB not found" error.file="/go/pkg/mod/github.com/project-velero/kopia@v0.13.0-velero.1/repo/blob/storage.go:186" error.function=github.com/kopia/kopia/repo/blob.init logSource="pkg/controller/data_upload_controller.go:410"
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=info msg="Reconcile daily-bkp-20241009020033-4m5h5" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:113"
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=info msg="FileSystemBR is closed" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/datapath/file_system.go:129" user=daily-bkp-20241009020033-4m5h5
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=info msg="Reconcile daily-bkp-20241009020033-4m5h5" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:113"

Is this somehow related to kopia issues: https://kopia.discourse.group/t/can-not-access-repository-failed-to-get-blob/1274/17 If so how do we resolve this? This error being intermittent says that the snapshot is present . And this is happening frequently in clusters with a bit huge data (retention of 14 days) which means large repository compared to small clusters(retention 3 days)

dharanui commented 1 month ago

Hi @Lyndon-Li , One more thing, if I delete all the backup repository CRs and trigger backup its completing succesfully. before deletion of backup repositories few datauploads were failing since last 1 week. And further after few days starts failing again.

Lyndon-Li commented 1 month ago

It further indicates that the repo snapshot is not missing. Deleting backup repository CR doesn't change data of the repo, it just reconnects the repo. If the snapshot is missing from the repo, deleting backup repository CR should not solve the problem.

dharanui commented 1 month ago

But if this is intermittent connection issue how are other datauploads of same repo (other volumes in same namespace) getting completed at the same time?

dharanui commented 1 month ago

Is it some issue with IRSA? because i can confirm that since last few days i am deletig backuprepo CR's frequently and i dont see this issue . But this is for sure not the way we would like to go forward.