piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
409 stars 64 forks source link

Unable to deploy piraeus-operator - connection refused from piraeus-op-cs.linstor.svc:3370 #289

Closed hickersonj closed 9 months ago

hickersonj commented 2 years ago

The csi-nodes, ha-controller and csi-controller are unable to connect to the piraeus-op-cs service.

My helm command is:

helm install -f storage-pools.yaml \
     piraeus-op piraeus-operator/charts/piraeus --create-namespace -n linstor

I changed the storage-pools to thin provision because the thick kept throwing out an error about not finding the drbdpool:

operator:
  satelliteSet:
    storagePools:
#      lvmPools:
#      - name: lvm-thick
#        volumeGroup: drbdpool
#        devicePaths:
#        - /dev/loop500
      lvmThinPools:
      - name: lvm-thin
        thinVolume: thinpool
        volumeGroup: ""
        devicePaths:
        - /dev/loop500
#      - /dev/vdd

I do not have entire disks to allocate to the operator, so I’m using loopback devices that are sitting on the disk:

dd if=/dev/zero of=/home/drdb.img bs=1G count=2
losetup /dev/loop500 /home/drdb.img

The pod that is crashing is the piraeus-op-ns-node, and the drbd-prometheus-exporter container within it:

k logs piraeus-op-ns-node-4hdzc drbd-prometheus-exporter -n linstor
WARN [drbd_reactor::events] events2: backing device information will be missing!
events2: unrecognized option '--full'
WARN [drbd_reactor::events] process_events2: exit
events2: unrecognized option '--full'
WARN [drbd_reactor::events] process_events2: exit
events2: unrecognized option '--full'
WARN [drbd_reactor::events] process_events2: exit
events2: unrecognized option '--full'
WARN [drbd_reactor::events] process_events2: exit
WARN [drbd_reactor::events] process_events2: could not update statistics
events2: unrecognized option '--full'
WARN [drbd_reactor::events] process_events2: exit
ERROR [drbd_reactor] main: events2 processing failed: events: events2_loop: Restarted events tracking too often, giving up

The pod doesn’t crash after I remove the satelliteSet.monitoringImage:

helm install -f storage-pools.yaml \
     piraeus-op piraeus-operator/charts/piraeus --create-namespace -n linstor \
     --set operator.satelliteSet.monitoringImage=""

However, this only gets the csi/ha-controller pods to hang in the init phase:

k get pod -n linstor -o wide
NAME                                         READY   STATUS     RESTARTS   AGE     IP               NODE             NOMINATED NODE   READINESS GATES
piraeus-op-operator-fb58bb6b8-l8s8h          1/1     Running    0          10m     192.168.17.53    cluster1-node3   <none>           <none>
piraeus-op-etcd-0                            1/1     Running    0          10m     192.168.17.55    cluster1-node3   <none>           <none>
piraeus-op-csi-node-nzrfh                    0/3     Init:0/1   0          9m46s   192.168.16.38    cluster1-node1   <none>           <none>
piraeus-op-ns-node-szcl7                     1/1     Running    0          9m46s   172.17.0.33      cluster1-node2   <none>           <none>
piraeus-op-csi-node-26fsk                    0/3     Init:0/1   0          9m46s   192.168.16.149   cluster1-node2   <none>           <none>
piraeus-op-cs-controller-55697b7688-wglsj    1/1     Running    0          9m45s   192.168.16.150   cluster1-node2   <none>           <none>
piraeus-op-ns-node-dgqwj                     1/1     Running    0          9m46s   172.17.0.1       cluster1-node3   <none>           <none>
piraeus-op-ns-node-58n65                     1/1     Running    0          6m49s   172.17.0.32      cluster1-node1   <none>           <none>
piraeus-op-csi-controller-687bd7fd44-bj8b6   0/6     Init:0/1   0          5m22s   192.168.16.39    cluster1-node1   <none>           <none>
piraeus-op-csi-node-v86v6                    0/3     Init:0/1   0          5m2s    192.168.17.58    cluster1-node3   <none>           <none>
piraeus-op-ha-controller-67696c96-xmb4v      0/1     Init:0/1   0          99s     192.168.17.63    cluster1-node3   <none>           <none>
k logs piraeus-op-ha-controller-67696c96-xmb4v wait-for-api -n linstor
time="2022-03-08T20:23:20Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/controller/version\](http://piraeus-op-cs.linstor.svc:3370/v1/controller/version/)": dial tcp 192.168.19.228:3370: connect: connection refused" version=refs/tags/v0.1.1
time="2022-03-08T20:23:30Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/controller/version\](http://piraeus-op-cs.linstor.svc:3370/v1/controller/version/)": dial tcp 192.168.19.228:3370: connect: connection refused" version=refs/tags/v0.1.1

k logs piraeus-op-csi-node-nzrfh linstor-wait-node-online -n linstor
time="2022-03-08T20:07:10Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/nodes/cluster1-node1\](http://piraeus-op-cs.linstor.svc:3370/v1/nodes/cluster1-node1/)": dial tcp: lookup piraeus-op-cs.linstor.svc on 192.168.19.10:53: server misbehaving" version=refs/tags/v0.1.1
time="2022-03-08T20:07:20Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/nodes/cluster1-node1\](http://piraeus-op-cs.linstor.svc:3370/v1/nodes/cluster1-node1/)": dial tcp 192.168.19.228:3370: connect: connection refused" version=refs/tags/v0.1.1
time="2022-03-08T20:07:30Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/nodes/cluster1-node1\](http://piraeus-op-cs.linstor.svc:3370/v1/nodes/cluster1-node1/)": dial tcp 192.168.19.228:3370: connect: connection refused" version=refs/tags/v0.1.1

k logs piraeus-op-csi-controller-687bd7fd44-bj8b6 linstor-wait-api-online -n linstor
time="2022-03-08T20:11:34Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/controller/version\](http://piraeus-op-cs.linstor.svc:3370/v1/controller/version/)": dial tcp 192.168.19.228:3370: connect: connection refused" version=refs/tags/v0.1.1
time="2022-03-08T20:11:44Z" level=info msg="not ready" error="Get \"[http://piraeus-op-cs.linstor.svc:3370/v1/controller/version\](http://piraeus-op-cs.linstor.svc:3370/v1/controller/version/)": dial tcp 192.168.19.228:3370: connect: connection refused" version=refs/tags/v0.1.1
WanzenBug commented 2 years ago
      lvmThinPools:
      - name: lvm-thin
        thinVolume: thinpool
        volumeGroup: ""
        devicePaths:
        - /dev/loop500

That won't work, the operator is very picky in what devices it can "format". To work around this, instead of specifying devicePaths: do the setup your self on each node:

pvcreate /dev/loop500
vgcreate vg1 /dev/loop500
lvcreate -l 100%FREE --thinpool vg1/thinpool

then use thinVolume: thinpool and volumeGroup: vg1 and no devicePaths.

Second issue: I think you are using a very old version of DRBD, which seems to cause errors in the monitoring image. If DRBD is installed on the host, make sure it is up to date (currently 9.1.6). If you used one of the init containers: make sure it didn't load DRBD 8.4 which is sometimes packaged on the host OS. Check cat /proc/drbd to find out which DRBD version you are running

hickersonj commented 2 years ago

@WanzenBug this is very helpful!

I did get a bit further by creating the pv before seeing your comment. However, it looks like I'll need to upgrade somethings on the host per your instructions:

lvcreate -l 100%FREE --thinpool vg1/thinpool
modprobe: FATAL: Module dm-thin-pool not found in directory /lib/modules/5.10.99
  /sbin/modprobe failed: 1
  thin-pool: Required device-mapper target(s) not detected in your kernel.
  Run `lvcreate --help' for more information.

I'll enable DM_THIN_PROVISIONING and upgrade the drbd version.

cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)

I'll fix those items and see where I get.

colegatron commented 2 years ago

@hickersonj did you managed to get it working?

Also I am curious if there is anyway to specify in the chart which devices are available on EACH node, since the nodes can be completely diferent from each other. I have machines with 2x1Tb disks, others with 1x512Gb nvme and others with 4Tb hdd disks.

Thanks in advance

hickersonj commented 2 years ago

@colegatron no I did not get it to work and gave up. It was a bit too complex for our use case.

colegatron commented 2 years ago

Finally the best solution is to do not use devices in the Linstor configuration. For various reasons, in fact; It is not easy to handle in the chart (or standard config), the name of the devices can vary depending on the host OS.

I am used to have everything automated in three steps; 1)hardaware and OS provisioning, 2)kubernetes cluster provisioning and 3)kubernetes services and workloads provisioning. So I moved the preparation of the storage devices to the step 1 with ansible, where I know which kind of storage(s) device(s) I have on each node. There you set up the LVM Volume Groups. For example; I have servers with 1 root partition 1 storage partition on nvme device, and 2 extra disks in ssd disks and 2 extra disks in hdd disks. So I can create a VG for each type of storage type.

Then in Linstor helm you only need to specify the "volumeGroup"s, it will take each VG and configure it to be used in the storagePools you specify. Remember, though, the VGs should do not have any LV nor filesystems before you install Linstor.

Another thing I learnt: If you want to reuse storage devices with pvc/lv created by linstor, they will clash with the new volumes created and will not work. You need to stop them manually at drbd level first and then at lvm level.