piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
405 stars 63 forks source link

Storage pool can not be deleted... #264

Open phoenix-bjoern opened 2 years ago

phoenix-bjoern commented 2 years ago

Since the update to 1.7.0 the linstor-controller every few seconds throws the error in the log The specified storage pool 'lvm-thin' on node 'host9' can not be deleted as volumes / snapshot-volumes are still using it.. Even if all containers with DRBD resources are scaled down and no DRBD resource is mounted on the cluster nodes the message doesn't disappear.

Is this a bug during the upgrade routine or is it smth we have to resolve manually?

Here is the error report:



============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.17.0
Build ID:                           7e646d83dbbadf1ec066e1bc8b29ae018aff1f66
Build time:                         2021-12-09T07:27:52+00:00
Error time:                         2022-02-01 18:08:56
Node:                               piraeus-op-cs-controller-6f7f457db-rs2q5
Peer:                               RestClient(10.42.1.105; 'Go-http-client/1.1')

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         ApiRcException
Class canonical name:               com.linbit.linstor.core.apicallhandler.response.ApiRcException
Generated at:                       Method 'deleteStorPoolInTransaction', Source file 'CtrlStorPoolApiCallHandler.java', Line #301

Error message:                      The specified storage pool 'lvm-thin' on node 'host9' can not be deleted as volumes / snapshot-volumes are still using it.

Error context:
    The specified storage pool 'lvm-thin' on node 'host9' can not be deleted as volumes / snapshot-volumes are still using it.

Call backtrace:

    Method                                   Native Class:Line number
    deleteStorPoolInTransaction              N      com.linbit.linstor.core.apicallhandler.controller.CtrlStorPoolApiCallHandler:301
    lambda$deleteStorPool$2                  N      com.linbit.linstor.core.apicallhandler.controller.CtrlStorPoolApiCallHandler:213
    doInScope                                N      com.linbit.linstor.core.apicallhandler.ScopeRunner:147
    lambda$fluxInScope$0                     N      com.linbit.linstor.core.apicallhandler.ScopeRunner:75
    call                                     N      reactor.core.publisher.MonoCallable:91
    trySubscribeScalarMap                    N      reactor.core.publisher.FluxFlatMap:126
    subscribeOrReturn                        N      reactor.core.publisher.MonoFlatMapMany:49
    subscribe                                N      reactor.core.publisher.Flux:8343
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Flux:8357
    onNext                                   N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:188
    request                                  N      reactor.core.publisher.Operators$ScalarSubscription:2344
    onSubscribe                              N      reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain:134
    subscribe                                N      reactor.core.publisher.MonoCurrentContext:35
    subscribe                                N      reactor.core.publisher.Mono:4252
    subscribeWith                            N      reactor.core.publisher.Mono:4363
    subscribe                                N      reactor.core.publisher.Mono:4223
    subscribe                                N      reactor.core.publisher.Mono:4159
    subscribe                                N      reactor.core.publisher.Mono:4131
    doFlux                                   N      com.linbit.linstor.api.rest.v1.RequestHelper:304
    deleteStorPool                           N      com.linbit.linstor.api.rest.v1.StoragePools:330
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor26:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:566
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker:159
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:469
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:391
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:80
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:253
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:232
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:680
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:200
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549
    run                                      N      java.lang.Thread:829

END OF ERROR REPORT.```
WanzenBug commented 2 years ago

My question is why the storage pool should be deleted? The only way this is triggered is if operator.satelliteSet.storagePools was modified (removed entries).

There are no additional checks happening in the operator, so it just tries to delete the storage pool, even if there are resources or snapshots still present. Note that this does include simple replicas, even if the drbd device is not actively mounted.

So my advise would be:

phoenix-bjoern commented 2 years ago

Thanks @WanzenBug for the fast reply. Actually neither storage pool nor the list of nodes has changed. And these errors occur on almost all our clusters which have been updated to Piraeus 1.7.

Here are screenshots from the linstor cmd output: Bildschirmfoto 2022-02-02 um 11 30 54 Bildschirmfoto 2022-02-02 um 11 30 45

The nodes and store pools look good and the storage pool is also referenced correctly for the resources.

I've checked the output of kubectl get LinstorSatelliteSet.piraeus.linbit.com piraeus-op-ns. Is it maybe a problem that the storage pools haven't be announced in the values file for the new CRD configuration? In the output the storagePools are empty (because we haven't declared it on the helm upgrade) but the SatelliteStatus OFC lists the storage pools which have been created on the first Linstor deployment on the cluster:


  sslSecret: null
  storagePools:
    lvmPools: []
    lvmThinPools: []
    zfsPools: []
  tolerations: []
status:
  SatelliteStatuses:
  - connectionStatus: ONLINE
    nodeName: de-fra-node10
    registeredOnController: true
    storagePoolStatus:
    - freeCapacity: 9223372036854775807
      name: DfltDisklessStorPool
      nodeName: de-fra-node10
      provider: DISKLESS
      totalCapacity: 9223372036854775807
    - freeCapacity: 1132126535
      name: lvm-thin
      nodeName: de-fra-node10
      provider: LVM_THIN
      totalCapacity: 1677721600
  - connectionStatus: ONLINE
    nodeName: de-fra-node8
    registeredOnController: true
    storagePoolStatus:
    - freeCapacity: 9223372036854775807
      name: DfltDisklessStorPool
      nodeName: de-fra-node8
      provider: DISKLESS
      totalCapacity: 9223372036854775807
    - freeCapacity: 1132126535
      name: lvm-thin
      nodeName: de-fra-node8
      provider: LVM_THIN
      totalCapacity: 1677721600
  - connectionStatus: ONLINE
    nodeName: de-fra-node9
    registeredOnController: true
    storagePoolStatus:
    - freeCapacity: 9223372036854775807
      name: DfltDisklessStorPool
      nodeName: de-fra-node9
      provider: DISKLESS
      totalCapacity: 9223372036854775807
    - freeCapacity: 1132126535
      name: lvm-thin
      nodeName: de-fra-node9
      provider: LVM_THIN
      totalCapacity: 1677721600
  errors:
  - "Message: 'The specified storage pool 'lvm-thin' on node 'de-fra-node9' can not
    be deleted as volumes / snapshot-volumes are still using it.'; Details: 'Volumes
    / snapshot-volumes that are still using the storage pool: \n   Node name: 'de-fra-node9',
    resource name: 'pvc-1e59589f-e04e-4aee-a1c6-0561a764a7e8', volume number: 0\n
    \  Node name: 'de-fra-node9', resource name: 'pvc-2c8ea040-9651-4501-a0d9-7b3920c82ec8',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-4506c792-6fbb-43b6-a6ca-745084259e0d',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-4e6cf7f8-2a42-4129-9a9b-cba310b3ed9e',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-7afac331-4319-4ca7-b587-1ec267dc63b8',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-7c07be83-43aa-44af-b9c6-c600339ad6a8',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-8366a898-20ca-4f71-abee-cc3c40ff8bf1',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-a2bff230-7ca6-4e93-a273-15d217967def',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-b1cb2468-4c1b-4872-8040-9e9860c45c76',
    volume number: 0\n   Node name: 'de-fra-node9', resource name: 'pvc-cabb808d-d90b-4c59-831f-a3f08420effc',
    volume number: 0\nNode: de-fra-node9, Storage pool name: lvm-thin'; Correction:
    'Delete the listed volumes and snapshot-volumes first.'; Reports: '[61F976B7-00000-071670]'"
  - "Message: 'The specified storage pool 'lvm-thin' on node 'de-fra-node8' can not
    be deleted as volumes / snapshot-volumes are still using it.'; Details: 'Volumes
    / snapshot-volumes that are still using the storage pool: \n   Node name: 'de-fra-node8',
    resource name: 'pvc-1e59589f-e04e-4aee-a1c6-0561a764a7e8', volume number: 0\n
    \  Node name: 'de-fra-node8', resource name: 'pvc-2c8ea040-9651-4501-a0d9-7b3920c82ec8',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-4506c792-6fbb-43b6-a6ca-745084259e0d',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-4e6cf7f8-2a42-4129-9a9b-cba310b3ed9e',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-7afac331-4319-4ca7-b587-1ec267dc63b8',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-7c07be83-43aa-44af-b9c6-c600339ad6a8',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-8366a898-20ca-4f71-abee-cc3c40ff8bf1',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-a2bff230-7ca6-4e93-a273-15d217967def',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-b1cb2468-4c1b-4872-8040-9e9860c45c76',
    volume number: 0\n   Node name: 'de-fra-node8', resource name: 'pvc-cabb808d-d90b-4c59-831f-a3f08420effc',
    volume number: 0\nNode: de-fra-node8, Storage pool name: lvm-thin'; Correction:
    'Delete the listed volumes and snapshot-volumes first.'; Reports: '[61F976B7-00000-071671]'"
  - "Message: 'The specified storage pool 'lvm-thin' on node 'de-fra-node10' can not
    be deleted as volumes / snapshot-volumes are still using it.'; Details: 'Volumes
    / snapshot-volumes that are still using the storage pool: \n   Node name: 'de-fra-node10',
    resource name: 'pvc-1e59589f-e04e-4aee-a1c6-0561a764a7e8', volume number: 0\n
    \  Node name: 'de-fra-node10', resource name: 'pvc-2c8ea040-9651-4501-a0d9-7b3920c82ec8',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-4506c792-6fbb-43b6-a6ca-745084259e0d',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-4e6cf7f8-2a42-4129-9a9b-cba310b3ed9e',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-7afac331-4319-4ca7-b587-1ec267dc63b8',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-7c07be83-43aa-44af-b9c6-c600339ad6a8',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-8366a898-20ca-4f71-abee-cc3c40ff8bf1',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-a2bff230-7ca6-4e93-a273-15d217967def',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-b1cb2468-4c1b-4872-8040-9e9860c45c76',
    volume number: 0\n   Node name: 'de-fra-node10', resource name: 'pvc-cabb808d-d90b-4c59-831f-a3f08420effc',
    volume number: 0\nNode: de-fra-node10, Storage pool name: lvm-thin'; Correction:
    'Delete the listed volumes and snapshot-volumes first.'; Reports: '[61F976B7-00000-071672]'"```
WanzenBug commented 2 years ago

Is it maybe a problem that the storage pools haven't be announced in the values file for the new CRD configuration?

Yes. If they are not set on helm upgrade, helm just removes them, and then the operator tries to delete the storage pool in a loop. It's a bit cumbersome, I know, but that's what helm does :shrug:

So you should edit the LinstorSatelliteSet to say:

spec:
  storagePools:
    lvmThinPools:
    - name: lvm-thin
      volumeGroup: vg-pool
      thinVolume: disk-redundant

And also save that to your helm overrides for the next upgrade.

phoenix-bjoern commented 2 years ago

Awesome, thanks for your help @WanzenBug , that actually resolved the problem. Our expectation was that the new CRD would get merged with the existing information in etcd, so we skipped the setting in the values.yaml. Maybe this should be added to