Open zrav opened 2 weeks ago
08:21:26.998 [SpaceTrackingService] ERROR LINSTOR/Controller - SYSTEM - Uncaught exception in j [Report number 66CAE97D-00000-000000]
For some reason this service does not want to start. This could be related to some corrupted DB entries. You can try to clean the space tracking database:
kubectl get -oyaml spacehistory.internal.linstor.linbit.com > spacehistory.yaml
. Attach it to this issue so we can debug the issuekubectl delete spacehistory.internal.linstor.linbit.com --all
and see if the controller starts.Deleting the space history did indeed make the controller become ready again. File attached. spacehistory.txt
However, now the ha-controller pods are crashing:
k8s@k8scp1:~$ kubectl logs -n piraeus-datastore ha-controller-n8hm6
I0826 09:30:33.143518 1 agent.go:201] version: v1.2.1
I0826 09:30:33.143594 1 agent.go:202] node: k8sw3.example.com
I0826 09:30:33.143699 1 agent.go:228] waiting for caches to sync
I0826 09:30:33.244814 1 agent.go:230] caches synced
I0826 09:30:33.244849 1 agent.go:253] starting reconciliation
E0826 09:30:33.248622 1 run.go:74] "command failed" err="failed to execute drbdsetup status --json: exit status 20"
Running the command directly on the node produces an expected output:
root@k8sw3:~# drbdsetup status --json
[
{
"name": "pvc-6064ae82-e444-49ac-aaea-ea8eaccab384",
"node-id": 2,
"role": "Primary",
"suspended": false,
"suspended-user": false,
"suspended-no-data": false,
"suspended-fencing": false,
"suspended-quorum": false,
"force-io-failures": false,
"write-ordering": "none",
"devices": [
{
"volume": 0,
"minor": 1002,
"disk-state": "Diskless",
"client": true,
"quorum": true,
"size": 65536,
"read": 0,
"written": 0,
"al-writes": 0,
"bm-writes": 0,
"upper-pending": 0,
"lower-pending": 0
} ],
"connections": [
{
"peer-node-id": 0,
"name": "k8sw1.example.com",
"connection-state": "Connected",
"congested": false,
"peer-role": "Secondary",
"tls": false,
"ap-in-flight": 0,
"rs-in-flight": 0,
"paths": [
{
"this_host": {
"address": "10.82.0.26",
"port": 7002,
"family": "ipv4"
},
"remote_host": {
"address": "10.82.0.23",
"port": 7002,
"family": "ipv4"
},
"established": true
} ],
"peer_devices": [
{
"volume": 0,
"replication-state": "Established",
"peer-disk-state": "UpToDate",
"peer-client": false,
"resync-suspended": "no",
"received": 492,
"sent": 13,
"out-of-sync": 0,
"pending": 0,
"unacked": 0,
"has-sync-details": false,
"has-online-verify-details": false,
"percent-in-sync": 100.00
} ]
},
{
"peer-node-id": 1,
"name": "k8sw2.example.com",
"connection-state": "Connected",
"congested": false,
"peer-role": "Secondary",
"tls": false,
"ap-in-flight": 0,
"rs-in-flight": 0,
"paths": [
{
"this_host": {
"address": "10.82.0.26",
"port": 7002,
"family": "ipv4"
},
"remote_host": {
"address": "10.82.0.24",
"port": 7002,
"family": "ipv4"
},
"established": true
} ],
"peer_devices": [
{
"volume": 0,
"replication-state": "Established",
"peer-disk-state": "UpToDate",
"peer-client": false,
"resync-suspended": "no",
"received": 0,
"sent": 0,
"out-of-sync": 0,
"pending": 0,
"unacked": 0,
"has-sync-details": false,
"has-online-verify-details": false,
"percent-in-sync": 100.00
} ]
} ]
}
]
Same issue here, deleting spacehistory resolved it. I can send my db if needed.
This is a k8s 1.30 test cluster on Ubuntu Jammy, DRBD modules installed via PPA. After upgrade from 2.5.1 to 2.5.2 the controller never becomes ready:
Happy to provide more info if told where to look.