Open andlf opened 1 year ago
Do you see any pattern? How often does it crash?
You can also use this configuration to increase the verbosity of drbd-reactor:
---
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
name: drbd-reactor-trace
spec:
patches:
- target:
kind: ConfigMap
name: reactor-config
patch: |
apiVersion: v1
kind: ConfigMap
metadata:
name: reactor-config
labels:
app.kubernetes.io/component: linstor-satellite
data:
prometheus.toml: |
[[prometheus]]
enums = true
address = "[::]:9942"
[[log]]
level = "trace"
You then need to restart the pods for it to take effect: kubectl delete pod -l app.kubernetes.io/component=linstor-satellite
They crashes after start at this cluster
k01 1/2 Error 4 (65s ago) 3m45s
k02 2/2 Running 4 (65s ago) 3m36s
k03 1/2 CrashLoopBackOff 4 (61s ago) 3m22s
k04 0/2 Pending 0 3m5s
k05 1/2 CrashLoopBackOff 4 (54s ago) 4m13s
k06 1/2 Error 5 (93s ago) 4m
, but in my second cluster not so often:
k01 2/2 Running 83 (154m ago) 4d22h
k02 2/2 Running 136 (114m ago) 3d22h
k03 2/2 Running 136 (154m ago) 4d22h
k04 2/2 Running 192 (155m ago) 4d22h
k05 2/2 Running 142 (154m ago) 4d22h
Logs now:
DEBUG [drbd_reactor] signal-handler: set up done
DEBUG [drbd_reactor] signal-handler: waiting for signals
DEBUG [drbd_reactor::events] events2_loop: starting process_events2 loop
DEBUG [drbd_reactor] main: configuration: Config {
log: [
LogConfig {
level: Info,
file: None,
},
LogConfig {
level: Trace,
file: None,
},
],
statistics_poll_interval: 60,
snippets: Some(
"/etc/drbd-reactor.d",
),
plugins: PluginConfig {
promoter: [],
debugger: [],
umh: [],
prometheus: [
PrometheusConfig {
address: "[::]:9942",
enums: true,
id: None,
},
],
},
}
DEBUG [drbd_reactor] main: started.len()=1
TRACE [drbd_reactor::plugin::prometheus] run: start
Error: main: core did not exit successfully
Caused by:
sending on a disconnected channel
DEBUG [drbd_reactor::events] events2_loop: send error on chanel, bye
ERROR [drbd_reactor] main: events2 processing failed: sending on a disconnected channel
nodes has taint "drbd.linbit.com/lost-quorum:NoSchedule", pods are in pending state until untaint nodes.
drbd-reactor 1.1.0 released, maybe try it?
nodes has taint "drbd.linbit.com/lost-quorum:NoSchedule", pods are in pending state until untaint nodes.
This is because the satellites are crashing. But the satellites Pods should have the right toleration to enable them to start.
I guess something in the prometheus exporter goes wrong. For some reason the logs are not showing us why, which needs to be investigated further.
nodes has taint "drbd.linbit.com/lost-quorum:NoSchedule", pods are in pending state until untaint nodes.
This is because the satellites are crashing. But the satellites Pods should have the right toleration to enable them to start.
I guess something in the prometheus exporter goes wrong. For some reason the logs are not showing us why, which needs to be investigated further.
Is drbd-reactor prometheus-exporter only or it has more important role? How can we temporary disable it by patch? Or switch to latest release?
It's prometheus exporter only. But if it's unused I'm not sure how it could possibly crash.
My drbd-reactors was never scraped by prometheus,i think it receives some info (config statistics-poll-interval = 300) from drbd and stores in memory. I had removed drbd-reactor from sattelite
---
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
name: no-reactor
spec:
patches:
- target:
kind: Pod
name: satellite
patch: |
apiVersion: v1
kind: Pod
metadata:
name: satellite
spec:
containers:
- name: drbd-reactor
$patch: delete
Pods running with 1 container, kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list-volumes shows all resources green and ok, but nodes always tainted by drbd.linbit.com/lost-quorum:NoSchedule .
Are the ha-controller
pods running OK? If yes, check the output of drbdsetup status
from the ha-controller
pods.
Ha-controller pods running without restarts, in drbd status on different nodes:
pvc-44a381cd-3c4f-4f9b-bce1-15be84bcf4b0 role:Secondary suspended:quorum
disk:Diskless quorum:no blocked:upper
k02 connection:StandAlone
k03 connection:StandAlone
k04 connection:StandAlone
pvc-7617767e-8740-4d03-af7d-8943047157d6 role:Secondary suspended:quorum
disk:Diskless quorum:no blocked:upper
k01 connection:StandAlone
k05 connection:StandAlone
k06 connection:StandAlone
pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 role:Secondary suspended:quorum
disk:Diskless quorum:no blocked:upper
k04 connection:StandAlone
k05 connection:StandAlone
Looks like the DRBD connection does not work. Have you applied the network policy changes from #435 ?
Yes, networkpolicy deleted. I found this:
pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 role:Secondary suspended:quorum
disk:Diskless quorum:no blocked:upper
k04 connection:StandAlone
k05 connection:StandAlone
but:
kubectl -n piraeus-datastore exec deploy/linstor-controller -ti -- bash
root@linstor-controller-b6c77bcfb-2jtnl:/# linstor r l -r pvc-c8cb0a8e-7962-4b3f-996d-a41335523362
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 ┊ k02 ┊ 7011 ┊ InUse ┊ Ok ┊ Diskless ┊ 2023-03-22 15:59:05 ┊
┊ pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 ┊ k04 ┊ 7011 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2023-03-22 15:58:57 ┊
┊ pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 ┊ k05 ┊ 7011 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2023-03-22 15:59:05 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Seems ok... I found only Diskless replicas has quorum:no
If placement=3 with 3 disk replica, where can be diskless replica? I think not. PVC was created with storageclass placement=2, deleted and created with placement=3, but diskless replica was not removed...
kubectl get pvc -A|grep pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934
default data-volume Bound pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 1Gi RWO linstor-thindata-r3 7d
0,13:23:02,afilippov@runner:~$lr|grep pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934
| k01 | pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 | vg0-thin | 0 | 1000 | /dev/drbd1000 | 979.58 MiB | Unused | UpToDate |
| k02 | pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 | vg0-thin | 0 | 1000 | /dev/drbd1000 | 979.58 MiB | InUse | UpToDate |
| k04 | pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 | vg0-thin | 0 | 1000 | /dev/drbd1000 | 1011.14 MiB | Unused | UpToDate |
and on node K06 diskless replica in no quorum status
$kubectl -n piraeus-datastore exec k06 -- drbdadm status |grep pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 -A1
pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 role:Secondary suspended:quorum
disk:Diskless client:yes quorum:no blocked:upper
yes, there are "lost" diskless replicas
lr|grep pvc-41c68379-8188-43a5-a618-a10dfc792608
| k02 | pvc-41c68379-8188-43a5-a618-a10dfc792608 | DfltDisklessStorPool | 0 | 1017 | /dev/drbd1017 | | InUse | Diskless |
| k04 | pvc-41c68379-8188-43a5-a618-a10dfc792608 | vg0-thin | 0 | 1017 | /dev/drbd1017 | 6.48 MiB | Unused | UpToDate |
| k05 | pvc-41c68379-8188-43a5-a618-a10dfc792608 | vg0-thin | 0 | 1017 | /dev/drbd1017 | 22.41 MiB | Unused | UpToDate |
and on the another node is lost qourume replica:
kubectl -n piraeus-datastore exec k06 -- drbdadm status |grep pvc-41c68379-8188-43a5-a618-a10dfc792608
pvc-41c68379-8188-43a5-a618-a10dfc792608 role:Secondary suspended:quorum
How could this happen? How can i remove it?
kubectl -n piraeus-datastore exec k06 -- drbdadm disconnect pvc-41c68379-8188-43a5-a618-a10dfc792608
'pvc-41c68379-8188-43a5-a618-a10dfc792608' not defined in your config (for this host).
command terminated with exit code 1
linstor r d k06 pvc-16a4bbb9-0cb4-4c06-a9e4-3e69f56a3808
WARNING:
Description:
Node: k06, Resource: pvc-16a4bbb9-0cb4-4c06-a9e4-3e69f56a3808 not found.
Details:
Node: k06, Resource: pvc-16a4bbb9-0cb4-4c06-a9e4-3e69f56a3808
Run
drbdsetup down <resource-name>
As to how this could happen, I have no idea :/
Hello! I have periodicaly satelite pods in CrashLoopBackOff , drbd-reactor fails with: Error: main: core did not exit successfully
Caused by: sending on a disconnected channel
How can i try to fix this? Thanks