vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.55k stars 1.38k forks source link

Couple issues with linstor backups to s3 #8017

Open jonathon2nd opened 1 month ago

jonathon2nd commented 1 month ago

What steps did you take and what happened:

Some snapshots are not finishing for some reason. Some snapshots say completed, but missing from s3. For example, pvc-99b78545-d39b-4502-be80-50eb3c4e428b is not present in s3. Issue does not seem to be k8s host or external linstor host related, there is not a correlation to pv/ k8s-worker errors

What did you expect to happen: All snapshots be present in s3

The following information will help us better understand what's going on: REPO URL: https://vmware-tanzu.github.io/helm-charts CHART: velero:7.1.0

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help This is not working.

postgres-backup-logs.txt mongodb-backup-logs.txt postgres-backup.txt mongodb-backup.txt preprod-velero-c57d98d86-65lhx_velero.log

Anything else you would like to add:

Only uploading for mongodb and postgres. Not including anything for tasks or api-cache that you may see in images. image image image image image

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 1 month ago

There are some warning and error information found from Velero pod log. IMO, this is still related to the block device backend. There should be some information printed in the Linstor's snapshot controller.

time="2024-07-15T20:39:55Z" level=warning msg="VolumeSnapshot has a temporary error Failed to check and update snapshot content: failed to take snapshot of the volume pvc-e384a2d9-5a9c-40fb-be46-feed10269ce3: \"rpc error: code = Internal desc = tried deleting a leftover unsuccessful snapshot\". Snapshot controller will retry later." cmd=/velero logSource="pkg/backup/actions/csi/volumesnapshot_action.go:318" pluginName=velero

time="2024-07-15T20:40:35Z" level=warning msg="VolumeSnapshot has a temporary error Failed to check and update snapshot content: failed to take snapshot of the volume pvc-e384a2d9-5a9c-40fb-be46-feed10269ce3: \"rpc error: code = Internal desc = failed to create snapshot: error creating S3 backup: Message: '(Node: 'ovbh-vtest-k8s02-worker01') Shutdown of the DRBD resource 'pvc-e384a2d9-5a9c-40fb-be46-feed10269ce3 failed'; Cause: 'The external command for stopping the DRBD resource failed'; Correction: '- Check whether the required software is installed\\n- Check whether the application's search path includes the location\\n  of the external software\\n- Check whether the application has execute permission for the external command\\n'; Reports: '[66958733-EF22C-000006]'\". Snapshot controller will retry later." cmd=/velero logSource="pkg/backup/actions/csi/volumesnapshot_action.go:318" pluginName=velero
time="2024-07-15T20:42:47Z" level=error msg="VolumeSnapshot mongodb3/velero-mongod-data-preprod-mongodb-rs0-2-c8jkz is not ready. This is not expected." backup=velero/mongodb3-test1 cmd=/velero controller=backup-finalizer logSource="pkg/util/csi/volume_snapshot.go:534" pluginName=velero

time="2024-07-15T20:37:37Z" level=error msg="VolumeSnapshot api-cache3/velero-redis-data-preprod-api-cache-redis-node-0-tkfd4 is not ready. This is not expected." backup=velero/api-cache-test1 cmd=/velero controller=backup-finalizer logSource="pkg/util/csi/volume_snapshot.go:534" pluginName=velero
phoenix-bjoern commented 1 month ago

@jonathon2nd I'm a Piraeus/Linstor user too. I've been working with the Linbit team to sort out all kinds of backup&restore issues and we've a quite stable solution now. I recommend to update Linstor and the CSI driver to the latest version (Linstor v1.28.0, CSI v1.6.3) as they contain all the fixes. Especially Linstor CSI < 1.6.2 had some caching issues, so it could be related to the problem you've described. Velero never caused any issue for us.

jonathon2nd commented 1 month ago

@blackpiglet Testing with just postgres now. Deleted and redid the pv in question to see if that was it. Looks like a no. image b3f6e526 pv is missing in s3 image preprod-velero-694fb9894b-7slbz_velero.log postgres1-backup.txt postgres1-backup-logs.txt image

@phoenix-bjoern We are using XOstor on our xen hosts. and are using the following values on piraeus-operator v2.5.1

---
installCRDs: true
imageConfigOverride:
- base: quay.io/piraeusdatastore
  components:
    linstor-satellite:
      image: piraeus-server
      tag: v1.26.1
    drbd-module-loader:
      image: drbd9-almalinux9
      tag: v9.2.10
    linstor-csi:
      tag: v1.6.3
      image: piraeus-csi
jonathon2nd commented 1 month ago

Restarted the linstor satellite on the host that corresponded to the missing pv from postgres, no change.

Ran fresh test. One of the pvs is still missing in s3. No useful errors. preprod-velero-c57d98d86-5w6xf_velero.log postgres1-backup.txt postgres1-backup-logs.txt image image image image

phoenix-bjoern commented 1 month ago

@jonathon2nd have you configured the S3 remote for Linstor? Check „linstor remote list“. I would recommend to configure the S3 credentials via StorageClass and Secret. furthermore you have to check the output of the Linstor CSI controller pod. IMHO this is a problem of the storage driver. Velero only sees the K8S und CSI part, which doesn’t provide enough insights. Focus on the storage component (Linstor) to debug.

jonathon2nd commented 1 month ago

Yes, the s3 remote has been configured for linstor. It is actively using it. However there is only 2 out of 3 pv's showing up in s3. There are no errors in csi-snapshotter registry.k8s.io/sig-storage/csi-snapshotter:v7.0.2

Restarting the suspected xen host resulted in no change to test results, still missing the same pv in backup to s3.

jonathon2nd commented 1 month ago

I am seeing this in the linstor controller for the external cluster

jonathon@jonathon-framework:~$ linstor --controllers=10.2.0.19 backup l linbit-velero-preprod-backup 
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource                                 ┊ Snapshot                                      ┊ Finished at         ┊ Based On ┊ Status  ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-78b0a64b-8804-4d64-95de-03b131a7dfd5 ┊ snapshot-36447f2e-9dab-486c-8155-b9da97eae7b3 ┊ 2024-07-17 10:10:46 ┊          ┊ Success ┊
┊ pvc-a1888eda-45aa-48ad-854a-c9d315491fdc ┊ snapshot-cfefa630-4499-4156-bd0a-5320a17df107 ┊ 2024-07-17 10:10:10 ┊          ┊ Success ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

No logs for the satellite tho :/ went and got all error logs from controller for a new test I just did (with exact same result so not uploading logs again) linstor-controller-errors.txt It says it is being queued, but does not get done it seems

jonathon2nd commented 1 month ago

So far not seeing any issues on linstor

jonathon@jonathon-framework:~$ linstor --controllers=10.2.0.19 snapshot l | grep -e 'pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1'
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-1613e2aa-2726-4a22-acdb-96afd4b9c134 | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-05 15:57:14 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-1e7b7479-ff81-4f75-8bb5-d73ede3f3eb1 | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-16 12:09:23 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-4c2e519f-19a6-4815-8bb9-cf75acac0c23 | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-15 13:39:04 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-82f29431-500a-438a-82fc-ae45f2e1ac11 | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-16 12:25:40 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-94fed088-9de1-4705-aace-8928dab5a44c | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-16 12:31:05 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-99f01961-b9bc-49f0-8383-edc0ef74b79c | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-16 14:06:15 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-9d7282d2-72b3-469a-adf7-46619ab8a286 | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-17 09:16:34 | Successful |
| pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 | snapshot-e392d2d1-6e18-49dc-a07f-64bc79fdd57e | ovbh-pprod-xen12                                     | 0: 20 GiB | 2024-07-17 10:10:00 | Successful |

If there is a command you want me to run or somewhere specific I can look, let me know.

phoenix-bjoern commented 1 month ago

@jonathon2nd I was thinking about your comment:

However there is only 2 out of 3 pv's showing up in s3.

Maybe Linstor doesn't receive the backup request at all. Can you verify that with linstor snapshot list? That should answer the question if the backup request has been received by Linstor and what the actual backup status is. Maybe the snapshot is stuck in "shipping" or "error" status. If that's the case, it sometimes helps to simply delete them (linstor snapshot delete pvc-xyz). N.B. If the snapshots can't be deleted, consider updating to Linstor v1.28.0, it contains a fix for such scenarios.

jonathon2nd commented 1 month ago

I did a snapshot list and grep'd for the pv name here, they are all Successful

Here is the full list as it currently is, with no deletions.

jonathon@jonathon-framework:~$ linstor --controllers=10.2.0.19 snapshot l
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ SnapshotName                                  ┊ NodeNames                                            ┊ Volumes   ┊ CreatedOn           ┊ State      ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-065ebfae-c8aa-4c7b-9268-11ad03699f7b ┊ snap1                                         ┊ ovbh-pprod-xen10, ovbh-pprod-xen11, ovbh-pprod-xen12 ┊ 0: 50 GiB ┊ 2023-01-17 08:45:25 ┊ Successful ┊
┊ pvc-1fa382ec-dc7a-4946-8696-42df8c85b7bf ┊ back_20230307_130715                          ┊ ovbh-pprod-xen13                                     ┊ 0: 25 GiB ┊ 2023-03-07 10:07:17 ┊ Successful ┊
┊ pvc-1fa382ec-dc7a-4946-8696-42df8c85b7bf ┊ back_20230307_132132                          ┊ ovbh-pprod-xen13                                     ┊ 0: 25 GiB ┊ 2023-03-07 10:21:33 ┊ Successful ┊
┊ pvc-1fa382ec-dc7a-4946-8696-42df8c85b7bf ┊ back_20230307_140755                          ┊ ovbh-pprod-xen13                                     ┊ 0: 25 GiB ┊ 2023-03-07 11:07:56 ┊ Successful ┊
┊ pvc-1fa382ec-dc7a-4946-8696-42df8c85b7bf ┊ back_20230307_161825                          ┊ ovbh-pprod-xen13                                     ┊ 0: 25 GiB ┊ 2023-03-07 13:18:51 ┊ Successful ┊
┊ pvc-1fa382ec-dc7a-4946-8696-42df8c85b7bf ┊ snapshot-22c691bb-93e6-44c4-9449-2f044348d5b4 ┊ ovbh-pprod-xen13                                     ┊ 0: 25 GiB ┊ 2023-03-07 10:08:03 ┊ Successful ┊
┊ pvc-4d81696e-0ada-46a9-8ba8-95a2efab24f5 ┊ snapshot-b656dd4b-a30f-4f52-a01c-a45d0c4a108e ┊ ovbh-pprod-xen11                                     ┊ 0: 8 GiB  ┊ 2024-07-15 13:37:04 ┊ Successful ┊
┊ pvc-6282e007-afb1-4049-ac8a-03e812e8e6dd ┊ snapshot-bc02f57f-f1f1-4313-8264-6fe01febd588 ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-15 10:53:30 ┊ Successful ┊
┊ pvc-78b0a64b-8804-4d64-95de-03b131a7dfd5 ┊ snapshot-36447f2e-9dab-486c-8155-b9da97eae7b3 ┊ ovbh-pprod-xen11                                     ┊ 0: 20 GiB ┊ 2024-07-17 10:10:10 ┊ Successful ┊
┊ pvc-78b0a64b-8804-4d64-95de-03b131a7dfd5 ┊ snapshot-789e63fa-bfae-450f-a111-a66d659eea6a ┊ ovbh-pprod-xen11                                     ┊ 0: 20 GiB ┊ 2024-07-15 13:39:25 ┊ Successful ┊
┊ pvc-99b78545-d39b-4502-be80-50eb3c4e428b ┊ snapshot-1dac25c5-bf23-42a0-a487-43080ae7bd54 ┊ ovbh-pprod-xen12                                     ┊ 0: 10 GiB ┊ 2024-07-15 10:43:04 ┊ Successful ┊
┊ pvc-99b78545-d39b-4502-be80-50eb3c4e428b ┊ snapshot-7defd36b-0f88-4fa8-9e98-16c5d38a46d5 ┊ ovbh-pprod-xen12                                     ┊ 0: 10 GiB ┊ 2024-07-16 12:08:24 ┊ Successful ┊
┊ pvc-99b78545-d39b-4502-be80-50eb3c4e428b ┊ snapshot-8b8e2023-6eee-4bf3-8fac-bf777b2776ff ┊ ovbh-pprod-xen12                                     ┊ 0: 10 GiB ┊ 2024-07-15 13:38:40 ┊ Successful ┊
┊ pvc-99b78545-d39b-4502-be80-50eb3c4e428b ┊ snapshot-c7a1728c-c1f1-4afa-ad0b-6425ff0c87aa ┊ ovbh-pprod-xen12                                     ┊ 0: 10 GiB ┊ 2024-07-05 15:22:15 ┊ Successful ┊
┊ pvc-99b78545-d39b-4502-be80-50eb3c4e428b ┊ snapshot-e4d8f369-91a1-4b82-b277-abf8f49be2d1 ┊ ovbh-pprod-xen12                                     ┊ 0: 10 GiB ┊ 2024-07-05 15:44:51 ┊ Successful ┊
┊ pvc-a1888eda-45aa-48ad-854a-c9d315491fdc ┊ snapshot-cfefa630-4499-4156-bd0a-5320a17df107 ┊ ovbh-pprod-xen13                                     ┊ 0: 20 GiB ┊ 2024-07-17 10:10:04 ┊ Successful ┊
┊ pvc-a836d7d1-7c3d-4c8d-b2bf-69eab3a62e03 ┊ snapshot-b62a5041-bdba-4aff-ab3b-db0b343a132c ┊ ovbh-pprod-xen10                                     ┊ 0: 10 GiB ┊ 2024-07-15 11:08:57 ┊ Successful ┊
┊ pvc-a836d7d1-7c3d-4c8d-b2bf-69eab3a62e03 ┊ snapshot-dfde5bba-cc28-47f9-a9ab-4c3a86294688 ┊ ovbh-pprod-xen10                                     ┊ 0: 10 GiB ┊ 2024-07-15 13:38:46 ┊ Successful ┊
┊ pvc-b039e181-3d9d-46ce-95ee-b7b57164a490 ┊ snapshot-0c46a4f4-af2a-44cd-8cda-ac9ee0e185e7 ┊ ovbh-pprod-xen13                                     ┊ 0: 8 GiB  ┊ 2024-07-05 14:10:34 ┊ Successful ┊
┊ pvc-b039e181-3d9d-46ce-95ee-b7b57164a490 ┊ snapshot-3e07e0d3-a359-43a5-9175-be85f8568dcd ┊ ovbh-pprod-xen13                                     ┊ 0: 8 GiB  ┊ 2024-07-05 15:11:12 ┊ Successful ┊
┊ pvc-b039e181-3d9d-46ce-95ee-b7b57164a490 ┊ snapshot-b4dafea8-3519-408e-adc5-cffd3212817c ┊ ovbh-pprod-xen13                                     ┊ 0: 8 GiB  ┊ 2024-07-05 14:50:08 ┊ Successful ┊
┊ pvc-b039e181-3d9d-46ce-95ee-b7b57164a490 ┊ snapshot-cc0e4bd4-854e-4906-bd06-29456e8e4988 ┊ ovbh-pprod-xen13                                     ┊ 0: 8 GiB  ┊ 2024-07-15 13:36:53 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-1613e2aa-2726-4a22-acdb-96afd4b9c134 ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-05 15:57:14 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-1e7b7479-ff81-4f75-8bb5-d73ede3f3eb1 ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-16 12:09:23 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-4c2e519f-19a6-4815-8bb9-cf75acac0c23 ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-15 13:39:04 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-82f29431-500a-438a-82fc-ae45f2e1ac11 ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-16 12:25:40 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-94fed088-9de1-4705-aace-8928dab5a44c ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-16 12:31:05 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-99f01961-b9bc-49f0-8383-edc0ef74b79c ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-16 14:06:15 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-9d7282d2-72b3-469a-adf7-46619ab8a286 ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-17 09:16:34 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-e392d2d1-6e18-49dc-a07f-64bc79fdd57e ┊ ovbh-pprod-xen12                                     ┊ 0: 20 GiB ┊ 2024-07-17 10:10:00 ┊ Successful ┊
┊ pvc-beb22792-02e6-4cb5-80ae-927d4d2760e4 ┊ snapshot-bc471bfb-664e-4a79-9914-fe6fdd50fe16 ┊ ovbh-pprod-xen13                                     ┊ 0: 8 GiB  ┊ 2024-07-15 14:09:37 ┊ Successful ┊
┊ pvc-c74d602e-dca4-402e-8d0e-e45e1661bf47 ┊ snapshot-0ff84b94-2b58-4c64-809f-07635be25674 ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-05 15:11:17 ┊ Successful ┊
┊ pvc-c74d602e-dca4-402e-8d0e-e45e1661bf47 ┊ snapshot-2c1fea74-704f-46a0-9068-85ba9b4d48fe ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-15 13:37:00 ┊ Successful ┊
┊ pvc-c74d602e-dca4-402e-8d0e-e45e1661bf47 ┊ snapshot-31cf7640-a547-4bc6-81b7-36cfb8dfd443 ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-05 14:10:43 ┊ Successful ┊
┊ pvc-c74d602e-dca4-402e-8d0e-e45e1661bf47 ┊ snapshot-7a95a055-16b7-48aa-9adc-444d2286831c ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-05 14:50:15 ┊ Successful ┊
┊ pvc-c74d602e-dca4-402e-8d0e-e45e1661bf47 ┊ snapshot-c3339056-8a9f-4b8d-889a-a811d698f9fd ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-16 12:07:54 ┊ Successful ┊
┊ pvc-c74d602e-dca4-402e-8d0e-e45e1661bf47 ┊ snapshot-f23aefbe-be69-4729-b53b-7e166a03c9a0 ┊ ovbh-pprod-xen12                                     ┊ 0: 8 GiB  ┊ 2024-07-05 14:24:54 ┊ Successful ┊
┊ pvc-ca67b2f5-32c0-4caa-a224-670fca73fc5c ┊ snapshot-12c73577-9ac2-4616-b934-7c98ef94a565 ┊ ovbh-pprod-xen10                                     ┊ 0: 8 GiB  ┊ 2024-07-15 14:09:53 ┊ Successful ┊
┊ pvc-cf8a0226-3469-47fd-9a03-a6f173033234 ┊ snapshot-aea23710-96f7-4ab8-82a9-35bedcb6a1c6 ┊ ovbh-pprod-xen13                                     ┊ 0: 10 GiB ┊ 2024-07-05 15:22:09 ┊ Successful ┊
┊ pvc-cf8a0226-3469-47fd-9a03-a6f173033234 ┊ snapshot-d01704f1-c8a9-4f33-924f-8e2cdd777e12 ┊ ovbh-pprod-xen13                                     ┊ 0: 10 GiB ┊ 2024-07-15 10:42:59 ┊ Successful ┊
┊ pvc-cf8a0226-3469-47fd-9a03-a6f173033234 ┊ snapshot-f3a45d1a-a998-4c58-b4da-da8ad6b89542 ┊ ovbh-pprod-xen13                                     ┊ 0: 10 GiB ┊ 2024-07-15 13:38:30 ┊ Successful ┊
┊ pvc-d81bbaa2-63b1-48b4-bc0e-3e3b62384719 ┊ snapshot-3b09528c-bdc5-4bd7-b13d-bd6cc6d04a8b ┊ ovbh-pprod-xen13                                     ┊ 0: 8 GiB  ┊ 2024-07-15 10:53:24 ┊ Successful ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
phoenix-bjoern commented 1 month ago

@jonathon2nd While taking the backup maybe use watch linstor s l. When the snapshot request appears in the list it should switch from "Incomplete" to "Successful", then "Shipping" and finally "Successful" again. This is because Linstor can also take local LVM/ZFS snapshots without shipping them to S3 (depends on the VSC settings). Usually the snapshots stay within the "shipping" state for reasonable time, to ship the LVM snapshot from the node to the S3 bucket. I had situation where it switched within seconds and no data was visible on S3. In this case the node had an LVM issue. The DRBD resources didn't show up when executing lsblk. After the node has been rebooted the issue was gone and the resources where shipped again correctly.

jonathon2nd commented 1 month ago

That host was restarted 5 days ago now.

At this point it must be something with this host, since it seems to fail to ever try the s3 shipping.

Saw the two go into shipping, the third one never did.

jonathon@jonathon-framework:~$ linstor s l
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ SnapshotName                                  ┊ NodeNames        ┊ Volumes   ┊ CreatedOn           ┊ State      ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-78b0a64b-8804-4d64-95de-03b131a7dfd5 ┊ snapshot-03191289-b245-41ff-9c7e-9ba581a387ce ┊ ovbh-pprod-xen11 ┊ 0: 20 GiB ┊ 2024-07-22 11:33:31 ┊ Successful ┊
┊ pvc-a1888eda-45aa-48ad-854a-c9d315491fdc ┊ snapshot-c63b8169-49fd-4c95-b0dd-7269912a7034 ┊ ovbh-pprod-xen13 ┊ 0: 20 GiB ┊ 2024-07-22 11:33:25 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5 ┊ ovbh-pprod-xen12 ┊ 0: 20 GiB ┊ 2024-07-22 11:33:21 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-9d7282d2-72b3-469a-adf7-46619ab8a286 ┊ ovbh-pprod-xen12 ┊ 0: 20 GiB ┊ 2024-07-17 09:16:34 ┊ Successful ┊
┊ pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 ┊ snapshot-e392d2d1-6e18-49dc-a07f-64bc79fdd57e ┊ ovbh-pprod-xen12 ┊ 0: 20 GiB ┊ 2024-07-17 10:10:00 ┊ Successful ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
jonathon@jonathon-framework:~$ linstor s d pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5
ERROR:
Description:
    Snapshot definition snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5 of resource pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 is currently being queued for backup shipping. Please wait until the shipping is finished or use backup abort --create
Details:
    Resource: pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1, Snapshot: snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5
Show reports:
    linstor error-reports show 6697E7DB-00000-000021

error report

jonathon@jonathon-framework:~$ 
    linstor error-reports show 6697E7DB-00000-000021
ERROR REPORT 6697E7DB-00000-000021

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Controller
Version:                            1.26.1
Build ID:                           12746ac9c6e7882807972c3df56e9a89eccad4e5
Build time:                         2024-02-22T05:27:50+00:00
Error time:                         2024-07-22 14:50:29
Node:                               ovbh-pprod-xen10
Thread:                             grizzly-http-server-11
Access context information

Identity:                           PUBLIC
Role:                               PUBLIC
Domain:                             PUBLIC

Peer:                               RestClient(10.1.8.153; 'PythonLinstor/1.23.0 (API1.0.4): Client 1.23.0')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'ensureSnapshotNotQueued', Source file 'CtrlSnapshotDeleteApiCallHandler.java', Line #260

Error message:                      Snapshot definition snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5 of resource pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 is currently being queued for backup shipping. Please wait until the shipping is finished or use backup abort --create

Error context:
        Snapshot definition snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5 of resource pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 is currently being queued for backup shipping. Please wait until the shipping is finished or use backup abort --create
Asynchronous stage backtrace:

    Error has been observed at the following site(s):
        *__checkpoint ⇢ Delete snapshot
    Original Stack Trace:

Call backtrace:

    Method                                   Native Class:Line number
    ensureSnapshotNotQueued                  N      com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotDeleteApiCallHandler:260

Suppressed exception 1 of 1:
===============
Category:                           RuntimeException
Class name:                         OnAssemblyException
Class canonical name:               reactor.core.publisher.FluxOnAssembly.OnAssemblyException
Generated at:                       Method 'ensureSnapshotNotQueued', Source file 'CtrlSnapshotDeleteApiCallHandler.java', Line #260

Error message:                      
Error has been observed at the following site(s):
    *__checkpoint ⇢ Delete snapshot
Original Stack Trace:

Error context:
        Snapshot definition snapshot-31d1c8bb-7683-4e53-88fb-5ca8a4a00bf5 of resource pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 is currently being queued for backup shipping. Please wait until the shipping is finished or use backup abort --create
Call backtrace:

    Method                                   Native Class:Line number
    ensureSnapshotNotQueued                  N      com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotDeleteApiCallHandler:260
    deleteSnapshotInTransaction              N      com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotDeleteApiCallHandler:183
    lambda$deleteSnapshot$0                  N      com.linbit.linstor.core.apicallhandler.controller.CtrlSnapshotDeleteApiCallHandler:136
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:149
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:76
    call                                     N      reactor.core.publisher.MonoCallable:72
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:127
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8759
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:195
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2545
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:141
    subscribe                                N      reactor.core.publisher.MonoJust:55
    subscribe                                N      reactor.core.publisher.MonoDeferContextual:55
    subscribe                                N      reactor.core.publisher.Flux:8773
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:195
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2545
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:141
    subscribe                                N      reactor.core.publisher.MonoJust:55
    subscribe                                N      reactor.core.publisher.MonoDeferContextual:55
    subscribe                                N      reactor.core.publisher.Mono:4495
    subscribeWith                            N      reactor.core.publisher.Mono:4561
    subscribe                                N      reactor.core.publisher.Mono:4462
    subscribe                                N      reactor.core.publisher.Mono:4398
    subscribe                                N      reactor.core.publisher.Mono:4370
    doFlux                                   N      com.linbit.linstor.api.rest.v1.RequestHelper:324
    deleteSnapshot                           N      com.linbit.linstor.api.rest.v1.Snapshots:169
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor52:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:566
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:146
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:189
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker:159
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:93
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:478
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:400
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:81
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:256
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:235
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:684
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:190
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:535
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:515
    run                                      N      java.lang.Thread:829
pvc-b3f6e526-4024-487a-b862-3f50dbb7d7f1 role:Secondary
  disk:UpToDate
  ovbh-vtest-k8s02-worker03.floatplane.com role:Primary
    peer-disk:Diskless

Lots of entries here, assuming from previous attempts

[14:48 ovbh-pprod-xen12 ~]# lsblk | grep -C 3 -e 'pvc--b3f6e526--4024--487a--b862'
└─nvme0n1p2                                                                                                                  259:5     0   1.8T  0 part  
  ├─linstor_group-thin_device_tdata                                                                                          252:1     0   3.5T  0 lvm   
  │ └─linstor_group-thin_device-tpool                                                                                        252:2     0   3.5T  0 lvm   
  │   ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--31d1c8bb--7683--4e53--88fb--5ca8a4a00bf5 252:20    0    20G  0 lvm   
  │   ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--9d7282d2--72b3--469a--adf7--46619ab8a286 252:49    0    20G  0 lvm   
  │   ├─linstor_group-pvc--6408a214--6def--44c4--8d9a--bebb67be5510_00000                                                    252:10    0    10G  0 lvm   
  │   │ └─drbd1057                                                                                                           147:1057  0    10G  0 disk  
  │   ├─linstor_group-pvc--04d7092d--5556--4ec5--8b36--4b62b9af0415_00000                                                    252:39    0   100G  0 lvm   
--
  │   │ └─drbd1113                                                                                                           147:1113  0    16G  0 disk  
  │   ├─linstor_group-pvc--48211679--90a7--44d2--bd57--2a27ed412bc9_00000                                                    252:51    0    50G  0 lvm   
  │   │ └─drbd1004                                                                                                           147:1004  0    50G  0 disk  
  │   ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000                                                    252:13    0    20G  0 lvm   
  │   │ └─drbd1009                                                                                                           147:1009  0    20G  0 disk  
  │   ├─linstor_group-pvc--5ae83a90--9684--4985--85c9--2f10c3a08a05_00000                                                    252:41    0    50G  0 lvm   
  │   │ └─drbd1024                                                                                                           147:1024  0    50G  0 disk  
--
  │   │ └─drbd1003                                                                                                           147:1003  0    10G  0 disk  
  │   ├─linstor_group-pvc--43e8e066--c424--4144--b117--34bf9bff5640_00000                                                    252:3     0    50G  0 lvm   
  │   │ └─drbd1078                                                                                                           147:1078  0    50G  0 disk  
  │   ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--e392d2d1--6e18--49dc--a07f--64bc79fdd57e 252:50    0    20G  0 lvm   
  │   ├─linstor_group-pvc--adc7c355--5496--48f8--8c9a--067105c83b99_00000                                                    252:12    0    10G  0 lvm   
  │   │ └─drbd1027                                                                                                           147:1027  0    10G  0 disk  
  │   ├─linstor_group-pvc--e546b94f--829f--4d6f--82b2--2bd107951492_00000                                                    252:40    0     8G  0 lvm   
--
  │     └─drbd1077                                                                                                           147:1077  0    10G  0 disk  
  └─linstor_group-thin_device_tmeta                                                                                          252:0     0   112M  0 lvm   
    └─linstor_group-thin_device-tpool                                                                                        252:2     0   3.5T  0 lvm   
      ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--31d1c8bb--7683--4e53--88fb--5ca8a4a00bf5 252:20    0    20G  0 lvm   
      ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--9d7282d2--72b3--469a--adf7--46619ab8a286 252:49    0    20G  0 lvm   
      ├─linstor_group-pvc--6408a214--6def--44c4--8d9a--bebb67be5510_00000                                                    252:10    0    10G  0 lvm   
      │ └─drbd1057                                                                                                           147:1057  0    10G  0 disk  
      ├─linstor_group-pvc--04d7092d--5556--4ec5--8b36--4b62b9af0415_00000                                                    252:39    0   100G  0 lvm   
--
      │ └─drbd1113                                                                                                           147:1113  0    16G  0 disk  
      ├─linstor_group-pvc--48211679--90a7--44d2--bd57--2a27ed412bc9_00000                                                    252:51    0    50G  0 lvm   
      │ └─drbd1004                                                                                                           147:1004  0    50G  0 disk  
      ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000                                                    252:13    0    20G  0 lvm   
      │ └─drbd1009                                                                                                           147:1009  0    20G  0 disk  
      ├─linstor_group-pvc--5ae83a90--9684--4985--85c9--2f10c3a08a05_00000                                                    252:41    0    50G  0 lvm   
      │ └─drbd1024                                                                                                           147:1024  0    50G  0 disk  
--
      │ └─drbd1003                                                                                                           147:1003  0    10G  0 disk  
      ├─linstor_group-pvc--43e8e066--c424--4144--b117--34bf9bff5640_00000                                                    252:3     0    50G  0 lvm   
      │ └─drbd1078                                                                                                           147:1078  0    50G  0 disk  
      ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--e392d2d1--6e18--49dc--a07f--64bc79fdd57e 252:50    0    20G  0 lvm   
      ├─linstor_group-pvc--adc7c355--5496--48f8--8c9a--067105c83b99_00000                                                    252:12    0    10G  0 lvm   
      │ └─drbd1027                                                                                                           147:1027  0    10G  0 disk  
      ├─linstor_group-pvc--e546b94f--829f--4d6f--82b2--2bd107951492_00000                                                    252:40    0     8G  0 lvm   
--
├─nvme1n1p2                                                                                                                  259:4     0   1.8T  0 part  
│ └─linstor_group-thin_device_tdata                                                                                          252:1     0   3.5T  0 lvm   
│   └─linstor_group-thin_device-tpool                                                                                        252:2     0   3.5T  0 lvm   
│     ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--31d1c8bb--7683--4e53--88fb--5ca8a4a00bf5 252:20    0    20G  0 lvm   
│     ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--9d7282d2--72b3--469a--adf7--46619ab8a286 252:49    0    20G  0 lvm   
│     ├─linstor_group-pvc--6408a214--6def--44c4--8d9a--bebb67be5510_00000                                                    252:10    0    10G  0 lvm   
│     │ └─drbd1057                                                                                                           147:1057  0    10G  0 disk  
│     ├─linstor_group-pvc--04d7092d--5556--4ec5--8b36--4b62b9af0415_00000                                                    252:39    0   100G  0 lvm   
--
│     │ └─drbd1113                                                                                                           147:1113  0    16G  0 disk  
│     ├─linstor_group-pvc--48211679--90a7--44d2--bd57--2a27ed412bc9_00000                                                    252:51    0    50G  0 lvm   
│     │ └─drbd1004                                                                                                           147:1004  0    50G  0 disk  
│     ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000                                                    252:13    0    20G  0 lvm   
│     │ └─drbd1009                                                                                                           147:1009  0    20G  0 disk  
│     ├─linstor_group-pvc--5ae83a90--9684--4985--85c9--2f10c3a08a05_00000                                                    252:41    0    50G  0 lvm   
│     │ └─drbd1024                                                                                                           147:1024  0    50G  0 disk  
--
│     │ └─drbd1003                                                                                                           147:1003  0    10G  0 disk  
│     ├─linstor_group-pvc--43e8e066--c424--4144--b117--34bf9bff5640_00000                                                    252:3     0    50G  0 lvm   
│     │ └─drbd1078                                                                                                           147:1078  0    50G  0 disk  
│     ├─linstor_group-pvc--b3f6e526--4024--487a--b862--3f50dbb7d7f1_00000_snapshot--e392d2d1--6e18--49dc--a07f--64bc79fdd57e 252:50    0    20G  0 lvm   
│     ├─linstor_group-pvc--adc7c355--5496--48f8--8c9a--067105c83b99_00000                                                    252:12    0    10G  0 lvm   
│     │ └─drbd1027                                                                                                           147:1027  0    10G  0 disk  
│     ├─linstor_group-pvc--e546b94f--829f--4d6f--82b2--2bd107951492_00000                                                    252:40    0     8G  0 lvm   
phoenix-bjoern commented 1 month ago

@jonathon2nd Sometimes it is helpful to manually clean up broken snapshots. You can remove the LVM snapshots with lvremove /dev/linstor_group-thin_device/pvc.... Also check the other nodes. I recommend to update to Linstor v1.28.0 and/or Piraeus Operator v2.5.2 which contains fixes for cleaning up snapshots which are infinitely in processing/shipping state (linstor s d pvc-…).

jonathon2nd commented 1 month ago

I will try that out. Already running Piraeus Operator v2.5.2 We are using XOstor, which uses Linstor, and are on the latest version. https://xcp-ng.org/forum/topic/5361/xostor-hyperconvergence-preview