piraeusdatastore / piraeus

High Available Datastore for Kubernetes
https://piraeus.io/
Apache License 2.0
441 stars 50 forks source link

Linstor Cluster on Arm64 Talos Linux is not working. #194

Closed OrvilleQ closed 3 weeks ago

OrvilleQ commented 3 weeks ago

Hello there. I'm new to Piraeus and Linstor, and I was trying to install the Piraeus operator on an arm64 cluster I created on Hetzner (only 1 node for now) using Talos Linux, but it's not working.

I was using Talos Linux v1.8.0 with Kubernetes v1.31.1. First, I deployed the operator using Kustomize with no issues at all.

[orville@WindDragon piraeus-operator]$ talosctl  --talosconfig /home/orville/Cell/talos/clusterconfig/talosconfig read /proc/modules
drbd_transport_tcp 28672 - - Live 0xffffc6e446f98000 (O)
drbd 720896 - - Live 0xffffc6e446eca000 (O)
[orville@WindDragon piraeus-operator]$ talosctl  --talosconfig /home/orville/Cell/talos/clusterconfig/talosconfig read /sys/module/drbd/parameters/usermode_helper
disabled
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - https://github.com/piraeusdatastore/piraeus-operator//config/default?ref=v2.6.0

Then I deployed a LinstorSatelliteConfiguration copied from this document, without change.

Finally I deployed a "blank" LinstorCluster yaml following the get started tutorial.

Things started to get strange from here. The linstor-satellite kept going into an Error state and entered CrashLoopBackOff, while the linstor-controller had already entered CrashLoopBackOff during the Init phase.

piraeus-datastore   ha-controller-ccx4s                                   1/1     Running                 0                41m
piraeus-datastore   linstor-controller-d5bd77df7-6qqgc                    0/1     Init:CrashLoopBackOff   12 (4m53s ago)   41m
piraeus-datastore   linstor-csi-controller-c6f6cb4ff-nvz79                0/7     Init:0/1                0                41m
piraeus-datastore   linstor-csi-node-rc258                                0/3     Init:0/1                0                41m
piraeus-datastore   linstor-satellite.fsn00-4kmrr                         0/2     Error                   2 (2s ago)       3s
piraeus-datastore   piraeus-operator-controller-manager-59cc6f54c-59vfl   1/1     Running                 0                42m
piraeus-datastore   piraeus-operator-gencert-6fbdbf68f8-9849t             1/1     Running                 0                42m

I tried to check the logs of linstor-satellite.

[orville@WindDragon piraeus-operator]$ kubectl logs -n piraeus-datastore linstor-satellite.fsn00-4kmrr
time="2024-09-28T12:52:32Z" level=info msg="running k8s-await-election" version=refs/tags/v0.4.1
time="2024-09-28T12:52:32Z" level=fatal msg="Failed to execve()" error="exec format error"

After [searching](https://github.com/search?q=org%3ALINBIT%20execve()&type=code), ~it seems like this execve() is something relate to systemd? which should be all disabled with the LinstorSatelliteConfiguration I applied.~

I'm not sure what's happening with the linstor cluster or where should I start to debug with this mess. Please help me with this issue.

Thank you.

WanzenBug commented 3 weeks ago

It looks like something went wrong when we switched our image builds in https://github.com/piraeusdatastore/piraeus/commit/547bb823f7b0b6e2cb4e3b899e89718abf009639

I've reverted the tags to point at the old manifests, so that should solve the immediate issue once you clean up the old images/reprovision.

WanzenBug commented 3 weeks ago

Fixed by 4f1af0f293a99ac1efdcce0e2dfd76bc4fbe84c2