Open dharanui opened 2 months ago
Looks like the repo content with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c
was missing, so the following backups for the same snapshot represented by this content always failed.
Take example for volume datadir-active-tracking-mongodb-secondary-0
:
2024-09-16T02:14:24Z
with snapshot ff8935066e0e428d4f2905013353fb14
2024-09-17T02:09:58Z
the problem happened complaining the parent snapshot manifest blob was not found2024-09-17T04:46:14Z
another backup succeeded for the same volume with the same parent snapshot ff8935066e0e428d4f2905013353fb14
Therefore, looks like the content was not found at one time but found at the other time.
The BLOB not found
was reported by the object store. Therefore, please share the info of your object store. And you can also check if object qc9e6d67bcf92c91ed614a2d63a27ce00-s670d20f3fffaf54212c
exists in your object store.
"Therefore, looks like the content was not found at one time but found at the other time." -> True, I have triggered a backup almost immediately after the backup is failed at "2024-09-17T02:09:58Z".
We use AWS S3 as object store. I am not sure how to check the object in object store. But if it was not present how will the subsequent backup succeed?
backup repository example:
apiVersion: velero.io/v1
kind: BackupRepository
metadata:
creationTimestamp: "2024-08-29T11:06:27Z"
generateName: active-tracking-default-kopia-
generation: 480
labels:
velero.io/repository-type: kopia
velero.io/storage-location: default
velero.io/volume-namespace: active-tracking
name: active-tracking-default-kopia-svqtc
namespace: velero
resourceVersion: "218166016"
uid: 5175b990-5122-4dc6-8813-11e3db56ebb6
spec:
backupStorageLocation: default
maintenanceFrequency: 1h0m0s
repositoryType: kopia
resticIdentifier: s3:s3-eu-west-1.amazonaws.com/ot3-qa-patch-velero-backup/restic/active-tracking
volumeNamespace: active-tracking
status:
lastMaintenanceTime: "2024-09-18T08:16:56Z"
phase: Ready
I also think the object should exist all the time. But the object store just returned 404
(BLOB not found
for Kopia) at the time when the problem happened. I think that is the problem.
corresponding log example from velero pod: "level=error msg="data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/feature-flag-proxy/redis-data-feature-flag-proxy-redis-replicas-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob \"qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c\": failed to get blob with ID qf139466a8e99d0a9065f927a84f0aee4-sd5191d7c73ebc3bb12c: BLOB not found, plugin: velero.io/csi-pvc-backupper" backup=velero/daily-bkp-20240918033004 logSource="pkg/controller/backup_controller.go:663"
Is there any way to tell the plugin to retry in case of 404? Or what are other places to check this?
this seems to happen quite frequently in multiple clusters. Tried with velero version 1.12 and 1.14 also aws-plugin versions.
Another notice-worthy thing is please check whether the snapshots can be shared across clusters.
the volumeSnapshots will be deleted after the dataupload is successful right? But however these (volumes and snapshots) can be shared across clusters. We do this testing on everyday basis.
The VolumeSnapshots are reset to retain the snapshot before deletion, so that is not the problem. I mean please check whether the cloud provider can support sharing the snapshots between clusters, but you already confirm that is not an issue in your environment in the previous reply.
I think we can use kopia
CLI to check whether the snapshot exist in the object store. q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c
.
@blackpiglet / @Lyndon-Li do we have any workaround for this at the moment?
Found the following warnings on node-agent is it related?
node-agent time="2024-09-24T02:02:58Z" level=warning msg="active indexes [xr0_7_5fc616defe00f04ddd4cbb1444f7e8c7-sc4455fdf2d353fa5-c1 xn9_07592f6e33cbfb35b171e26f3450747c-sc5a932c3e062da0f12c-c1 xn9_10f827950e25c36b2405e63cf75a226b-s65ef87e5c616798712c-c1 xn9_14ddd7437f5720c24b237bc0deb22bf8-se901f10811af4f5f12d-c1 xn9_16e195a514cbba289a0ff3ef06db5d6f-s9614bd11b68aa6b212c-c1 xn9_1b0cb5d825399323a430a121f34476eb-s588095dfbac520c412c-c1 xn9_1e9549c8004c8c7268bad88419beab36-s9e4c96b059eddf3712d-c1 xn9_363917e4ac30525628976b596ef064a6-s5db7801afbadff7212c-c1 xn9_3a694490dfde37d62dae07b9b44ed0c2-sd1bc0aaf5b64e17112d-c1 xn9_3c0d542715d47c6c98902b309a90b251-sa8791d496c562fc112c-c1 xn9_3e6be1beb10d8d4e1c912caab27c3e5d-s755589140036a22012c-c1 xn9_53b9bf91278eb5c12346df5d63674faf-sc9f612ebdbd2a0ce12c-c1 xn9_639b5b6a8082b62c25c4a3935d50b6d7-sde1a45d58364560812d-c1 xn9_65fab6b9e1f26eaec6aa67521b1f78af-s093e4037cd9fc77b12c-c1 xn9_7ce07bd6999c9b6c1faa68af67987f87-s035cfd62c5d3644e12d-c1 xn9_81ac4312bb80f0d1186b231087b23f05-s7b90ce81329cb91812c-c1 xn9_8209b3b790a161761c84389a5f194002-s62f35ccd1cb1a0cb12c-c1 xn9_cd23b9ad0846ddf02f3d760bf5ace8dc-s7cdcedd28e8b24a612c-c1 xn9_cda8b06ba4a871a6075bae991eda8111-s0d0bce2d983eadcc12c-c1 xn9_d17d9345eea8d30e44182a388c48efcd-s9cdf0b0cdbd99eaa12d-c1 xn9_d3257f5446f4da22f224f753f68ee432-sdc539f6c6c87654012c-c1 xn9_dd532a5b43313a79639213de2464e8b8-s0db90ab1c3d9fb2412d-c1 xn9_e41bfe5b368eadec73cca2640ca5a9d2-s4d49b708bddaff8312c-c1 xn9_e838905b2a045727172d1518c04d9037-s43808c5f5d632ecf12c-c1 xn9_e9fe6494f98a8bc416517fe754a40383-sa2068559f11620a612d-c1 xn9_fb57dc5c187a4de3eec07fb7911b5b3d-secb786f3edaf429e12c-c1 xn9_fe21260253354ede021f822921b376c1-s0c06d9592eb3d69b12c-c1 xn10_00609770d0c78ff21286ce008d6b8ed7-sfe0a567df5eff63b12d-c1 xn10_08418a307bad786065e08baa65bb8d74-s494cde92e1b5f13012d-c1 xn10_092752e1ffa9226f61fce1308201541e-sfcd5cedcfd2cb4d212d-c1 xn10_1dce7427ad64f225d65692ccdedd1e3a-s2a85c940c8f73af612d-c1 xn10_3819707c2f4c6e6c34114cd409345f34-sb168d866e512d21312d-c1 xn10_4670b1f09e6f5394fff9c8ff45071509-sa2cad73ac0e9fdb512d-c1 xn10_4d74081a46fdbc4903964b5a23e7f921-s2fa4528c987552e712d-c1 xn10_4ff2621b3827a3ceb351ad7ba37d0a9e-s4db7aa7af554dc7a12d-c1 xn10_5bfb3687c48dd1abbc6add70c42d1b3f-sd41468b08298693c12d-c1 xn10_89b031dbcc0eec199c40d95eeaba16ee-s3650b95ad5c3489e12d-c1 xn10_8a7222cc2486e07f8cdf1e3a783da8e8-se76da0199f2b97cc12d-c1 xn10_95349a5f4bed81e32ffef280da1a14e7-scfaa896c1d9d776a12d-c1 xn10_9c586fe9a6e4a504bd9077b1620b2c21-s74c14ec8fe45d41212d-c1 xn10_a82e80af983db869d21fa7288b868f9f-s4f92e15e30b2f9b512d-c1 xn10_ad008aaf537636d340ebb263e914cbae-sbe46290e8f673d2c12d-c1 xn10_b0e2829139c853137c673e3a86635fcf-sd31964a47c2bd32d12d-c1 xn10_bb905a8b8e266cd6994271d6065700f4-sbae8b502911ff4c312d-c1 xn10_dd76752723fb1d4deda22600d2868270-se11374fe4c1a0d8412d-c1 xn10_df433df44424bd5c167d7eee7948921a-scb663fe2e899520a12d-c1 xn10_ee77f8f18fa5207191d5e1d8b718bad5-sd823c3c30eae664a12d-c1] deletion watermark 2024-09-22 20:59:27 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20240924020030-9qz56 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
The warning message is not relevant. The weird thing is that the problem happens to your env only (we don't see other users report this problem) and even in your env, the problem is not always happen.
So we need more info to get the root cause. Please try to answer below questions:
Hi @Lyndon-Li
Hi @Lyndon-Li , We had a cluster where we did not do any manual activity of deletion of backups. But still we saw this error there. we also increased the maintainence frequency to 24h. one observation is that this error is not as often in other clusters as in production cluster, but ofcourse that is most important for us. Only difference is the amount of data.
For this question:
Can you restore from every backup when you see this problem?
If the answer is Yes
, it means the repository snapshot is not actually missing, otherwise, the restore will fail.
Therefore, I still suspect it is an intermittent S3 API problem, so a retry may fix the problem. But we need to find some way to prove this suspection.
my bad, we do restore for single namespace whose dataupload is completed. The one with failed dataupload will not be able to get restored. the dataupload goes to accepted state, once volumesnapshot is ready, it goes to prepared and failed. Dataupload is not happening at all. Corresponding node-agent logs for one such dataupload:
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Exposed snapshot is ready and creating data path routine" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:258"
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Data upload is marked as in progress" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:284"
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Run cancelable dataUpload" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:330"
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Got volume dir" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/exposer/host_path.go:48" pod UID=163a5148-bfb3-4b69-9823-1dec50872a9b pod name=daily-bkp-20241009020033-4m5h5 volDir=pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount volume=145239de-a210-4013-bc7d-284b951ee1af
node-agent-xzxgq node-agent time="2024-10-09T02:15:11Z" level=info msg="Found path matching glob" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/exposer/host_path.go:68" path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" pod UID=163a5148-bfb3-4b69-9823-1dec50872a9b pod name=daily-bkp-20241009020033-4m5h5 volume=145239de-a210-4013-bc7d-284b951ee1af
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=warning msg="active indexes [xr0_7_84846fd7cc32b00fa4316eac1e20db72-sb7fcd8d42039be18-c1 xr8_15_7b11bfb73873562d88271496e6c1b679-sb1b1112075daef90-c1 xs16_2bbaddef226d82434a22149961920e78-s62d41a382d4535ca-c1 xs17_1c39a21660639ef45ba2cc3c58a2cf25-s0fd0a71432e054fd-c1 xs18_6132fd91e15cb7f46817d115a3b4ae3b-s436414805fbfc0aa-c1 xn21_173a6f72e027d6888451dd0f5a6cda8e-saa09c65e841ff1e812d-c1 xn21_22ff5eab22781ccf94a89d7494ad5859-s7a9921e94fa8480912d-c1 xn21_24f6fd0fd274b56914db7f868f9ece0b-s73a82e8615c6a9db12d-c1 xn21_29b0b8786c8cc03c2d2d9961e15f059b-s531ed460d509a9da12d-c1 xn21_2f0d9f474daeba513c6d23a688a6714a-s382d625348e91fdb12d-c1 xn21_49863f2f55bc69bb284332fadc764b4c-s6338c9e8fcdc008e12d-c1 xn21_4e8c03f34c1bd60cf69f0937a3aab842-sddb17f391444081112d-c1 xn21_54015f93e74f681e051727a191e8e255-s1701224ea10a9a6e12d-c1 xn21_5ea4a9b43741958c25b23cc761ff8251-s783f40d755bfa63f12d-c1 xn21_75013d00120378b3aec3a22276877061-s24cd8bd367ad33d612d-c1 xn21_a61c404a0213a92d3ee8db2e28a35dd6-s6be0a35f810f81b912d-c1 xn21_afd955cde2be6d83387dfc8a1c0dac0a-s505acb1abe49bc2c12d-c1 xn21_b282a7b45fcaeaa86ef456489b1c833f-sdeecdd07ef64e5f412d-c1 xn21_c309dc18c2642c1fcc5666aae504b0fc-s01763ba0fb0f134612d-c1 xn21_da48c980eb3fa44b2da7cdca4fa4aa1e-sd7de2efd7d1fe3d512d-c1 xn21_de03cd1f532f579e978df2093cc9666f-s830fae6c5f69286312d-c1 xn21_e1b6342274b334993060438f3ea72c4a-sbedc98d4b9c0246b12d-c1 xn21_e3f779d92a06186952729f0c876b89fa-s6804b8214cdb089d12d-c1 xn21_eed63f9ac30738915e688af81b08bed0-sff092b35e86d7f5412d-c1 xn21_f58d7bf065d12d1d5751fc6c702da50a-s1ef543d920a044b412d-c1 xn22_0c955b3d4dc6deaadf2f98bb0cd3ba46-s61a04a4b8ca16bd812d-c1 xn22_3f51316ef40a4056674840e3d3880d90-s2105294b2fcc651a12d-c1 xn22_a8ae76caecea418d588dfe12af6cb564-s3e34f1f30440989d12d-c1 xn22_bb182b1341393233b0330e64c5d9aa58-sbb40caf3611baa2212d-c1 xn22_d8b165d11a05c1135020821dbdaee30a-s96d351948d4a5a8312d-c1 xn22_ff61b04a77d7b756452d81aa62b44727-s85d5a317f30ed81812d-c1] deletion watermark 2024-10-08 01:07:32 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" sublevel=error
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Opening backup repo" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/provider/kopia.go:78" repoUID=922a870e-4d96-424f-ad21-75c6f471582c
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=warning msg="active indexes [xr0_7_84846fd7cc32b00fa4316eac1e20db72-sb7fcd8d42039be18-c1 xr8_15_7b11bfb73873562d88271496e6c1b679-sb1b1112075daef90-c1 xs16_2bbaddef226d82434a22149961920e78-s62d41a382d4535ca-c1 xs17_1c39a21660639ef45ba2cc3c58a2cf25-s0fd0a71432e054fd-c1 xs18_6132fd91e15cb7f46817d115a3b4ae3b-s436414805fbfc0aa-c1 xn21_173a6f72e027d6888451dd0f5a6cda8e-saa09c65e841ff1e812d-c1 xn21_22ff5eab22781ccf94a89d7494ad5859-s7a9921e94fa8480912d-c1 xn21_24f6fd0fd274b56914db7f868f9ece0b-s73a82e8615c6a9db12d-c1 xn21_29b0b8786c8cc03c2d2d9961e15f059b-s531ed460d509a9da12d-c1 xn21_2f0d9f474daeba513c6d23a688a6714a-s382d625348e91fdb12d-c1 xn21_49863f2f55bc69bb284332fadc764b4c-s6338c9e8fcdc008e12d-c1 xn21_4e8c03f34c1bd60cf69f0937a3aab842-sddb17f391444081112d-c1 xn21_54015f93e74f681e051727a191e8e255-s1701224ea10a9a6e12d-c1 xn21_5ea4a9b43741958c25b23cc761ff8251-s783f40d755bfa63f12d-c1 xn21_75013d00120378b3aec3a22276877061-s24cd8bd367ad33d612d-c1 xn21_a61c404a0213a92d3ee8db2e28a35dd6-s6be0a35f810f81b912d-c1 xn21_afd955cde2be6d83387dfc8a1c0dac0a-s505acb1abe49bc2c12d-c1 xn21_b282a7b45fcaeaa86ef456489b1c833f-sdeecdd07ef64e5f412d-c1 xn21_c309dc18c2642c1fcc5666aae504b0fc-s01763ba0fb0f134612d-c1 xn21_da48c980eb3fa44b2da7cdca4fa4aa1e-sd7de2efd7d1fe3d512d-c1 xn21_de03cd1f532f579e978df2093cc9666f-s830fae6c5f69286312d-c1 xn21_e1b6342274b334993060438f3ea72c4a-sbedc98d4b9c0246b12d-c1 xn21_e3f779d92a06186952729f0c876b89fa-s6804b8214cdb089d12d-c1 xn21_eed63f9ac30738915e688af81b08bed0-sff092b35e86d7f5412d-c1 xn21_f58d7bf065d12d1d5751fc6c702da50a-s1ef543d920a044b412d-c1 xn22_0c955b3d4dc6deaadf2f98bb0cd3ba46-s61a04a4b8ca16bd812d-c1 xn22_3f51316ef40a4056674840e3d3880d90-s2105294b2fcc651a12d-c1 xn22_a8ae76caecea418d588dfe12af6cb564-s3e34f1f30440989d12d-c1 xn22_bb182b1341393233b0330e64c5d9aa58-sbb40caf3611baa2212d-c1 xn22_d8b165d11a05c1135020821dbdaee30a-s96d351948d4a5a8312d-c1 xn22_ff61b04a77d7b756452d81aa62b44727-s85d5a317f30ed81812d-c1] deletion watermark 2024-10-08 01:07:32 +0000 UTC" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" sublevel=error
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="FileSystemBR is initialized" bsl=default controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 jobName=daily-bkp-20241009020033-4m5h5 logSource="pkg/datapath/file_system.go:110" repository=kopia source namespace=active-tracking uploader=kopia
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="fs init" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:341" path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Async fs backup data path started" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:350" path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Reconcile daily-bkp-20241009020033-4m5h5" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:113"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Data upload is in progress" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:292"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Starting backup" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/provider/kopia.go:144" parentSnapshot= path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" realSource=active-tracking/datadir-active-tracking-mongodb-primary-0
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Start to snapshot..." controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/kopia/snapshot.go:230" parentSnapshot= path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" realSource=active-tracking/datadir-active-tracking-mongodb-primary-0
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Searching for parent snapshot" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/kopia/snapshot.go:245" parentSnapshot= path="/host_pods/163a5148-bfb3-4b69-9823-1dec50872a9b/volumes/kubernetes.io~csi/pvc-98cf3ea8-9328-4bd5-837b-2c961f688e6b/mount" realSource=active-tracking/datadir-active-tracking-mongodb-primary-0
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=error msg="Async fs backup data path failed" dataupload=daily-bkp-20241009020033-4m5h5 error="Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID q504b6cb0fdaf00c81b2b2ecc15e9be36-s032bc87b4027220412d: BLOB not found" error.file="/go/pkg/mod/github.com/project-velero/kopia@v0.13.0-velero.1/repo/blob/storage.go:186" error.function=github.com/kopia/kopia/repo/blob.init logSource="pkg/controller/data_upload_controller.go:403"
node-agent-xzxgq node-agent time="2024-10-09T02:15:12Z" level=info msg="Action finished" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/uploader/provider/kopia.go:91"
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=warning msg="Failed to patch dataupload with err Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID q504b6cb0fdaf00c81b2b2ecc15e9be36-s032bc87b4027220412d: BLOB not found" dataupload=daily-bkp-20241009020033-4m5h5 error="Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/active-tracking/datadir-active-tracking-mongodb-primary-0 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content: failed to get blob with ID q504b6cb0fdaf00c81b2b2ecc15e9be36-s032bc87b4027220412d: BLOB not found" error.file="/go/pkg/mod/github.com/project-velero/kopia@v0.13.0-velero.1/repo/blob/storage.go:186" error.function=github.com/kopia/kopia/repo/blob.init logSource="pkg/controller/data_upload_controller.go:410"
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=info msg="Reconcile daily-bkp-20241009020033-4m5h5" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:113"
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=info msg="FileSystemBR is closed" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/datapath/file_system.go:129" user=daily-bkp-20241009020033-4m5h5
node-agent-xzxgq node-agent time="2024-10-09T02:15:13Z" level=info msg="Reconcile daily-bkp-20241009020033-4m5h5" controller=dataupload dataupload=velero/daily-bkp-20241009020033-4m5h5 logSource="pkg/controller/data_upload_controller.go:113"
Is this somehow related to kopia issues: https://kopia.discourse.group/t/can-not-access-repository-failed-to-get-blob/1274/17 If so how do we resolve this? This error being intermittent says that the snapshot is present . And this is happening frequently in clusters with a bit huge data (retention of 14 days) which means large repository compared to small clusters(retention 3 days)
Hi @Lyndon-Li , One more thing, if I delete all the backup repository CRs and trigger backup its completing succesfully. before deletion of backup repositories few datauploads were failing since last 1 week. And further after few days starts failing again.
It further indicates that the repo snapshot is not missing. Deleting backup repository CR doesn't change data of the repo, it just reconnects the repo. If the snapshot is missing from the repo, deleting backup repository CR should not solve the problem.
But if this is intermittent connection issue how are other datauploads of same repo (other volumes in same namespace) getting completed at the same time?
Is it some issue with IRSA? because i can confirm that since last few days i am deletig backuprepo CR's frequently and i dont see this issue . But this is for sure not the way we would like to go forward.
What steps did you take and what happened: velero version 1.12 also tried upgrading to 1.14.1 aws plugin 1.10.1
scheduled backup runs everyday with csi datamover.
backups are in partiallyFailed status intermittently with few datauploads in "Failed" state. On describing the failed datauploads we find the following error.
data path backup failed: Failed to run kopia backup: Failed to find previous kopia snapshot manifests for si default@default:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2: unable to find manifest entries: failed to get manifests with labels map[hostname:default path:snapshot-data-upload-download/kopia/elastic-system/elasticsearch-data-logging-elasticsearch-es-master-nodeset-2 type:snapshot username:default]: error to find manifests: unable to load manifest contents: error loading manifest content: error getting cached content from blob "q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c": failed to get blob with ID q78b42bb7bf6a17c91da3fdd28388d193-s8b43049f3444df3912c: BLOB not found
What did you expect to happen: All datauploads are in completed state and Backups are successful everyday.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
bundle-2024-09-17-11-32-20.tar.gzIf you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment: Production environment
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.