piraeusdatastore / linstor-csi

CSI plugin for LINSTOR
Apache License 2.0
109 stars 27 forks source link

NumberFormatExceptions followed by "snapshot name is in use for volume ..." #174

Open blampe opened 2 years ago

blampe commented 2 years ago

I'm trying to setup snapshot shipping via CSI (https://github.com/piraeusdatastore/linstor-csi/pull/142) without much success. Most recently I had a flood of errors around parse failures, but there's not enough detail to figure out what it's failing to parse. I thought maybe it had something to do with my VolumeClassName ("linstor.minio") having a period in the name maybe throwing off parsing, but I was able to create another backup (from a significantly smaller volume) using the same class. This all might be a red herring...

It seems after these NumberFormatExceptions happen the system gets into a mixed up state as you can see below -- namely it tries to re-assign an existing snapshot name to a different PVC.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2022-08-29T08:31:01Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generateName: velero-vmsingle-vm-victoria-metrics-k8s-stack-
  generation: 1
  labels:
    velero.io/backup-name: linstor-20220829080004
  name: velero-vmsingle-vm-victoria-metrics-k8s-stack-844m6
  namespace: prometheus
  resourceVersion: "56560018"
  uid: 17660871-6cf4-4d07-8c11-8c9f5922625a
spec:
  source:
    persistentVolumeClaimName: vmsingle-vm-victoria-metrics-k8s-stack
  volumeSnapshotClassName: linstor.minio
status:
  boundVolumeSnapshotContentName: snapcontent-17660871-6cf4-4d07-8c11-8c9f5922625a
  error:
    message: 'Failed to check and update snapshot content: failed to take snapshot
      of the volume pvc-7ca5e665-9bd1-4f9d-8209-92dcb8474edf: "rpc error: code = AlreadyExists
      desc = can''t use \"snapshot-17660871-6cf4-4d07-8c11-8c9f5922625a\" for snapshot
      name for volume \"pvc-7ca5e665-9bd1-4f9d-8209-92dcb8474edf\", snapshot name
      is in use for volume \"pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93\""'
    time: "2022-08-29T14:58:59Z"
  readyToUse: false

pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93 is indeed another volume that was snapshotted around the same time as this one. Is there perhaps a race condition where creating multiple snapshots around the same time can cause names to not be unique?

# linstor err l ``` | 630C4D06-00000-000045 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000046 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000047 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000048 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000049 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000050 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000051 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000052 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000053 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000054 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000055 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000056 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000057 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000058 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000059 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000060 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000061 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000062 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000063 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000064 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000065 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000066 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000067 | 2022-08-29 14:26:04 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: For input string: ".2022022E4.2022022E4" | | 630C4D06-00000-000068 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000069 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000070 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000071 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000072 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000073 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000074 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | | 630C4D06-00000-000075 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points | ```
# linstor err show (multiple points) ``` ERROR REPORT 630C4D06-00000-000066 ============================================================ Application: LINBIT�� LINSTOR Module: Controller Version: 1.19.1 Build ID: a758bf07796c374fd2004465b0d8690209b74356 Build time: 2022-07-27T06:36:54+00:00 Error time: 2022-08-29 14:21:52 Node: piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 ============================================================ Reported error: =============== Category: RuntimeException Class name: NumberFormatException Class canonical name: java.lang.NumberFormatException Generated at: Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #1890 Error message: multiple points Call backtrace: Method Native Class:Line number readJavaFormatString N jdk.internal.math.FloatingDecimal:1890 parseDouble N jdk.internal.math.FloatingDecimal:110 parseDouble N java.lang.Double:543 getDouble N java.text.DigitList:169 parse N java.text.DecimalFormat:2126 subParse N java.text.SimpleDateFormat:1931 parse N java.text.SimpleDateFormat:1541 parse N java.text.DateFormat:393 N com.linbit.linstor.backupshipping.S3MetafileNameInfo:46 listBackups N com.linbit.linstor.core.apicallhandler.controller.backup.CtrlBackupApiCallHandler:679 lambda$listBackups$0 N com.linbit.linstor.api.rest.v1.Backups:304 doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:231 doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:203 listBackups N com.linbit.linstor.api.rest.v1.Backups:298 invoke N jdk.internal.reflect.GeneratedMethodAccessor194:unknown invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43 invoke N java.lang.reflect.Method:566 lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52 run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124 invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167 doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176 dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79 invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:469 apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:391 apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:80 run N org.glassfish.jersey.server.ServerRuntime$1:253 call N org.glassfish.jersey.internal.Errors$1:248 call N org.glassfish.jersey.internal.Errors$1:244 process N org.glassfish.jersey.internal.Errors:292 process N org.glassfish.jersey.internal.Errors:274 process N org.glassfish.jersey.internal.Errors:244 runInScope N org.glassfish.jersey.process.internal.RequestScope:265 process N org.glassfish.jersey.server.ServerRuntime:232 handle N org.glassfish.jersey.server.ApplicationHandler:680 service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356 run N org.glassfish.grizzly.http.server.HttpHandler$1:200 doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569 run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549 run N java.lang.Thread:829 END OF ERROR REPORT. ```
# linstor err show (For input string) ``` ERROR REPORT 630C4D06-00000-000067 ============================================================ Application: LINBIT�� LINSTOR Module: Controller Version: 1.19.1 Build ID: a758bf07796c374fd2004465b0d8690209b74356 Build time: 2022-07-27T06:36:54+00:00 Error time: 2022-08-29 14:26:04 Node: piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 ============================================================ Reported error: =============== Category: RuntimeException Class name: NumberFormatException Class canonical name: java.lang.NumberFormatException Generated at: Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #2054 Error message: For input string: ".2022022E4.2022022E4" Call backtrace: Method Native Class:Line number readJavaFormatString N jdk.internal.math.FloatingDecimal:2054 parseDouble N jdk.internal.math.FloatingDecimal:110 parseDouble N java.lang.Double:543 getDouble N java.text.DigitList:169 parse N java.text.DecimalFormat:2126 subParse N java.text.SimpleDateFormat:1931 parse N java.text.SimpleDateFormat:1541 parse N java.text.DateFormat:393 N com.linbit.linstor.backupshipping.S3MetafileNameInfo:46 listBackups N com.linbit.linstor.core.apicallhandler.controller.backup.CtrlBackupApiCallHandler:679 lambda$listBackups$0 N com.linbit.linstor.api.rest.v1.Backups:304 doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:231 doInScope N com.linbit.linstor.api.rest.v1.RequestHelper:203 listBackups N com.linbit.linstor.api.rest.v1.Backups:298 invoke N jdk.internal.reflect.GeneratedMethodAccessor194:unknown invoke N jdk.internal.reflect.DelegatingMethodAccessorImpl:43 invoke N java.lang.reflect.Method:566 lambda$static$0 N org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52 run N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124 invoke N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167 doDispatch N org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176 dispatch N org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79 invoke N org.glassfish.jersey.server.model.ResourceMethodInvoker:469 apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:391 apply N org.glassfish.jersey.server.model.ResourceMethodInvoker:80 run N org.glassfish.jersey.server.ServerRuntime$1:253 call N org.glassfish.jersey.internal.Errors$1:248 call N org.glassfish.jersey.internal.Errors$1:244 process N org.glassfish.jersey.internal.Errors:292 process N org.glassfish.jersey.internal.Errors:274 process N org.glassfish.jersey.internal.Errors:244 runInScope N org.glassfish.jersey.process.internal.RequestScope:265 process N org.glassfish.jersey.server.ServerRuntime:232 handle N org.glassfish.jersey.server.ApplicationHandler:680 service N org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356 run N org.glassfish.grizzly.http.server.HttpHandler$1:200 doWork N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569 run N org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549 run N java.lang.Thread:829 END OF ERROR REPORT. ```
WanzenBug commented 2 years ago

Weird. Since the error seems to be in listBackups (which gets called by the CSI driver on creation) I assume it's not actually about any snapshot it tries to create, but about what files already exist in the bucket. This looks like an issue that needs to be fixed upstream: https://github.com/LINBIT/linstor-server/issues

blampe commented 2 years ago

Running with --log-level=debug shed a little more light on this...

time="2022-09-02T02:10:03Z" level=debug msg="no snapshot matching id found, trying backups" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1367" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/s3'"
time="2022-09-02T02:10:03Z" level=debug msg="listing backups in remote" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1377" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client remote=minio
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/minio/backups?snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba'"

time="2022-09-02T02:10:03Z" level=debug msg="found existing snapshot" func=github.com/piraeusdatastore/linstor-csi/pkg/driver.Driver.CreateSnapshot file="/src/pkg/driver/driver.go:965" existingSnapshot="snapshot_id:\"snapshot-dc295533-ba78-4e3f-b110-54175f1386f1\" source_volume_id:\"pvc-621a06d8-0503-4cf3-84af-d75e1bf40508\" creation_time:<seconds:1662034561 > " linstorCSIComponent=driver nodeID=pi4c provisioner=linstor.csi.linbit.com requestedSnapshotName=snapshot-1a9289d2-b984-4527-baf4-522b42908e0e requestedSnapshotSourceVolume=pvc-d2ec0c0d-f114-4fe2-9c30-a56279e356b6 version=v0.20.0-d514e41db7cdcb580769cc69f1c1ef2b8a5def5d

That last line is interesting... It queries for a backup with snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba and it gets a result... but that result is a totally different one (snapshot-dc295533-ba78-4e3f-b110-54175f1386f1) than the one it asked for?

Furthermore the snapshot returned is unrecoverable. That should at least unblock me (I can delete it), although I'm surprised it wasn't deleted/cleaned up automatically (it was totally hidden in the CLI, which is as good as deleted, anyway).

root@piraeus-op-cs-controller-5b8ff7f4c8-nqns5:/# linstor backup l minio -s snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba -p
+------------------------------------------------------------------------------------------------------------------------------------+
| Resource                                 | Snapshot                                      | Finished at | Based On | Status         |
|====================================================================================================================================|
| pvc-621a06d8-0503-4cf3-84af-d75e1bf40508 | snapshot-dc295533-ba78-4e3f-b110-54175f1386f1 |             |          | Not restorable |
+------------------------------------------------------------------------------------------------------------------------------------+

What's still confusing me (and where there's maybe room for the CSI to improve) is why this UUID keeps getting recycled for so many different volumes. I know very little about CSI, but naively it seems like each snapshot request should have its own UUID, making collisions like this impossible. The only way I can see that happening is if CompatibleSnapshotId gets the same invalid name for multiple snapshots... some of my PVCs do have identical names, if that's relevant? Is there any downside to adding entropy to the fallback ID that's generated?

blampe commented 2 years ago

@WanzenBug take a look at https://github.com/LINBIT/linstor-server/pull/314 -- the presence of any backups in the cluster completely breaks snapshotting.