Open blampe opened 2 years ago
Weird. Since the error seems to be in listBackups
(which gets called by the CSI driver on creation) I assume it's not actually about any snapshot it tries to create, but about what files already exist in the bucket. This looks like an issue that needs to be fixed upstream: https://github.com/LINBIT/linstor-server/issues
Running with --log-level=debug
shed a little more light on this...
time="2022-09-02T02:10:03Z" level=debug msg="no snapshot matching id found, trying backups" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1367" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/s3'"
time="2022-09-02T02:10:03Z" level=debug msg="listing backups in remote" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1377" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client remote=minio
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/minio/backups?snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba'"
time="2022-09-02T02:10:03Z" level=debug msg="found existing snapshot" func=github.com/piraeusdatastore/linstor-csi/pkg/driver.Driver.CreateSnapshot file="/src/pkg/driver/driver.go:965" existingSnapshot="snapshot_id:\"snapshot-dc295533-ba78-4e3f-b110-54175f1386f1\" source_volume_id:\"pvc-621a06d8-0503-4cf3-84af-d75e1bf40508\" creation_time:<seconds:1662034561 > " linstorCSIComponent=driver nodeID=pi4c provisioner=linstor.csi.linbit.com requestedSnapshotName=snapshot-1a9289d2-b984-4527-baf4-522b42908e0e requestedSnapshotSourceVolume=pvc-d2ec0c0d-f114-4fe2-9c30-a56279e356b6 version=v0.20.0-d514e41db7cdcb580769cc69f1c1ef2b8a5def5d
That last line is interesting... It queries for a backup with snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba
and it gets a result... but that result is a totally different one (snapshot-dc295533-ba78-4e3f-b110-54175f1386f1
) than the one it asked for?
Furthermore the snapshot returned is unrecoverable. That should at least unblock me (I can delete it), although I'm surprised it wasn't deleted/cleaned up automatically (it was totally hidden in the CLI, which is as good as deleted, anyway).
root@piraeus-op-cs-controller-5b8ff7f4c8-nqns5:/# linstor backup l minio -s snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba -p
+------------------------------------------------------------------------------------------------------------------------------------+
| Resource | Snapshot | Finished at | Based On | Status |
|====================================================================================================================================|
| pvc-621a06d8-0503-4cf3-84af-d75e1bf40508 | snapshot-dc295533-ba78-4e3f-b110-54175f1386f1 | | | Not restorable |
+------------------------------------------------------------------------------------------------------------------------------------+
What's still confusing me (and where there's maybe room for the CSI to improve) is why this UUID keeps getting recycled for so many different volumes. I know very little about CSI, but naively it seems like each snapshot request should have its own UUID, making collisions like this impossible. The only way I can see that happening is if CompatibleSnapshotId
gets the same invalid name for multiple snapshots... some of my PVCs do have identical names, if that's relevant? Is there any downside to adding entropy to the fallback ID that's generated?
@WanzenBug take a look at https://github.com/LINBIT/linstor-server/pull/314 -- the presence of any backups in the cluster completely breaks snapshotting.
I'm trying to setup snapshot shipping via CSI (https://github.com/piraeusdatastore/linstor-csi/pull/142) without much success. Most recently I had a flood of errors around parse failures, but there's not enough detail to figure out what it's failing to parse. I thought maybe it had something to do with my VolumeClassName ("linstor.minio") having a period in the name maybe throwing off parsing, but I was able to create another backup (from a significantly smaller volume) using the same class. This all might be a red herring...
It seems after these NumberFormatExceptions happen the system gets into a mixed up state as you can see below -- namely it tries to re-assign an existing snapshot name to a different PVC.
pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93 is indeed another volume that was snapshotted around the same time as this one. Is there perhaps a race condition where creating multiple snapshots around the same time can cause names to not be unique?
# linstor err l
``` | 630C4D06-00000-000045 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000046 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000047 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000048 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000049 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000050 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000051 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000052 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000053 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000054 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000055 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000056 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000057 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000058 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000059 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000060 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000061 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000062 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000063 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000064 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000065 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000066 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000067 | 2022-08-29 14:26:04 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: For input string: ".2022022E4.2022022E4" | | 630C4D06-00000-000068 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000069 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000070 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000071 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000072 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000073 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000074 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000075 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | ```# linstor err show (multiple points)
``` ERROR REPORT 630C4D06-00000-000066 ============================================================ Application: LINBIT�� LINSTOR Module: Controller Version: 1.19.1 Build ID: a758bf07796c374fd2004465b0d8690209b74356 Build time: 2022-07-27T06:36:54+00:00 Error time: 2022-08-29 14:21:52 Node: piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 ============================================================ Reported error: =============== Category: RuntimeException Class name: NumberFormatException Class canonical name: java.lang.NumberFormatException Generated at: Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #1890 Error message: multiple points Call backtrace: Method Native Class:Line number readJavaFormatString N jdk.internal.math.FloatingDecimal:1890 parseDouble N jdk.internal.math.FloatingDecimal:110 parseDouble N java.lang.Double:543 getDouble N java.text.DigitList:169 parse N java.text.DecimalFormat:2126 subParse N java.text.SimpleDateFormat:1931 parse N java.text.SimpleDateFormat:1541 parse N java.text.DateFormat:393# linstor err show (For input string)
``` ERROR REPORT 630C4D06-00000-000067 ============================================================ Application: LINBIT�� LINSTOR Module: Controller Version: 1.19.1 Build ID: a758bf07796c374fd2004465b0d8690209b74356 Build time: 2022-07-27T06:36:54+00:00 Error time: 2022-08-29 14:26:04 Node: piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 ============================================================ Reported error: =============== Category: RuntimeException Class name: NumberFormatException Class canonical name: java.lang.NumberFormatException Generated at: Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #2054 Error message: For input string: ".2022022E4.2022022E4" Call backtrace: Method Native Class:Line number readJavaFormatString N jdk.internal.math.FloatingDecimal:2054 parseDouble N jdk.internal.math.FloatingDecimal:110 parseDouble N java.lang.Double:543 getDouble N java.text.DigitList:169 parse N java.text.DecimalFormat:2126 subParse N java.text.SimpleDateFormat:1931 parse N java.text.SimpleDateFormat:1541 parse N java.text.DateFormat:393