piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
414 stars 64 forks source link

drbd-reactor fails with "sending on a disconnected channel" #441

Open andlf opened 1 year ago

andlf commented 1 year ago

Hello! I have periodicaly satelite pods in CrashLoopBackOff , drbd-reactor fails with: Error: main: core did not exit successfully

Caused by: sending on a disconnected channel

How can i try to fix this? Thanks

WanzenBug commented 1 year ago

Do you see any pattern? How often does it crash?

You can also use this configuration to increase the verbosity of drbd-reactor:

---
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: drbd-reactor-trace
spec:
  patches:
    - target:
        kind: ConfigMap
        name: reactor-config
      patch: |
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: reactor-config
          labels:
            app.kubernetes.io/component: linstor-satellite
        data:
          prometheus.toml: |
            [[prometheus]]
            enums = true
            address = "[::]:9942"

            [[log]]
            level = "trace"

You then need to restart the pods for it to take effect: kubectl delete pod -l app.kubernetes.io/component=linstor-satellite

andlf commented 1 year ago

They crashes after start at this cluster

k01                                                    1/2     Error              4 (65s ago)     3m45s
k02                                                    2/2     Running            4 (65s ago)     3m36s
k03                                                    1/2     CrashLoopBackOff   4 (61s ago)     3m22s
k04                                                    0/2     Pending            0               3m5s
k05                                                    1/2     CrashLoopBackOff   4 (54s ago)     4m13s
k06                                                    1/2     Error              5 (93s ago)     4m

, but in my second cluster not so often:

k01                                                    2/2     Running   83 (154m ago)    4d22h
k02                                                    2/2     Running   136 (114m ago)   3d22h
k03                                                    2/2     Running   136 (154m ago)   4d22h
k04                                                    2/2     Running   192 (155m ago)   4d22h
k05                                                    2/2     Running   142 (154m ago)   4d22h

Logs now:

DEBUG [drbd_reactor] signal-handler: set up done
DEBUG [drbd_reactor] signal-handler: waiting for signals
DEBUG [drbd_reactor::events] events2_loop: starting process_events2 loop
DEBUG [drbd_reactor] main: configuration: Config {
    log: [
        LogConfig {
            level: Info,
            file: None,
        },
        LogConfig {
            level: Trace,
            file: None,
        },
    ],
    statistics_poll_interval: 60,
    snippets: Some(
        "/etc/drbd-reactor.d",
    ),
    plugins: PluginConfig {
        promoter: [],
        debugger: [],
        umh: [],
        prometheus: [
            PrometheusConfig {
                address: "[::]:9942",
                enums: true,
                id: None,
            },
        ],
    },
}
DEBUG [drbd_reactor] main: started.len()=1
TRACE [drbd_reactor::plugin::prometheus] run: start
Error: main: core did not exit successfully

Caused by:
    sending on a disconnected channel
DEBUG [drbd_reactor::events] events2_loop: send error on chanel, bye
ERROR [drbd_reactor] main: events2 processing failed: sending on a disconnected channel

nodes has taint "drbd.linbit.com/lost-quorum:NoSchedule", pods are in pending state until untaint nodes.

drbd-reactor 1.1.0 released, maybe try it?

WanzenBug commented 1 year ago

nodes has taint "drbd.linbit.com/lost-quorum:NoSchedule", pods are in pending state until untaint nodes.

This is because the satellites are crashing. But the satellites Pods should have the right toleration to enable them to start.

I guess something in the prometheus exporter goes wrong. For some reason the logs are not showing us why, which needs to be investigated further.

WanzenBug commented 1 year ago

nodes has taint "drbd.linbit.com/lost-quorum:NoSchedule", pods are in pending state until untaint nodes.

This is because the satellites are crashing. But the satellites Pods should have the right toleration to enable them to start.

I guess something in the prometheus exporter goes wrong. For some reason the logs are not showing us why, which needs to be investigated further.

andlf commented 1 year ago

Is drbd-reactor prometheus-exporter only or it has more important role? How can we temporary disable it by patch? Or switch to latest release?

WanzenBug commented 1 year ago

It's prometheus exporter only. But if it's unused I'm not sure how it could possibly crash.

andlf commented 1 year ago

My drbd-reactors was never scraped by prometheus,i think it receives some info (config statistics-poll-interval = 300) from drbd and stores in memory. I had removed drbd-reactor from sattelite

---
apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: no-reactor
spec:
  patches:
    - target:
        kind: Pod
        name: satellite
      patch: |
        apiVersion: v1
        kind: Pod
        metadata:
          name: satellite
        spec:
          containers:
          - name: drbd-reactor
            $patch: delete

Pods running with 1 container, kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list-volumes shows all resources green and ok, but nodes always tainted by drbd.linbit.com/lost-quorum:NoSchedule .

WanzenBug commented 1 year ago

Are the ha-controller pods running OK? If yes, check the output of drbdsetup status from the ha-controller pods.

andlf commented 1 year ago

Ha-controller pods running without restarts, in drbd status on different nodes:

pvc-44a381cd-3c4f-4f9b-bce1-15be84bcf4b0 role:Secondary suspended:quorum
  disk:Diskless quorum:no blocked:upper
  k02 connection:StandAlone
  k03 connection:StandAlone
  k04 connection:StandAlone

pvc-7617767e-8740-4d03-af7d-8943047157d6 role:Secondary suspended:quorum
  disk:Diskless quorum:no blocked:upper
  k01 connection:StandAlone
  k05 connection:StandAlone
  k06 connection:StandAlone

pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 role:Secondary suspended:quorum
  disk:Diskless quorum:no blocked:upper
  k04 connection:StandAlone
  k05 connection:StandAlone
WanzenBug commented 1 year ago

Looks like the DRBD connection does not work. Have you applied the network policy changes from #435 ?

andlf commented 1 year ago

Yes, networkpolicy deleted. I found this:

pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 role:Secondary suspended:quorum
  disk:Diskless quorum:no blocked:upper
  k04 connection:StandAlone
  k05 connection:StandAlone

but:

kubectl -n piraeus-datastore exec deploy/linstor-controller -ti -- bash
root@linstor-controller-b6c77bcfb-2jtnl:/# linstor r l -r pvc-c8cb0a8e-7962-4b3f-996d-a41335523362
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 ┊ k02  ┊ 7011 ┊ InUse  ┊ Ok    ┊ Diskless ┊ 2023-03-22 15:59:05 ┊
┊ pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 ┊ k04  ┊ 7011 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2023-03-22 15:58:57 ┊
┊ pvc-c8cb0a8e-7962-4b3f-996d-a41335523362 ┊ k05  ┊ 7011 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2023-03-22 15:59:05 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Seems ok... I found only Diskless replicas has quorum:no

andlf commented 1 year ago

If placement=3 with 3 disk replica, where can be diskless replica? I think not. PVC was created with storageclass placement=2, deleted and created with placement=3, but diskless replica was not removed...

kubectl get pvc -A|grep pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934
default            data-volume                                               Bound    pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934   1Gi        RWO            linstor-thindata-r3   7d
0,13:23:02,afilippov@runner:~$lr|grep pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934
| k01  | pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 | vg0-thin             |     0 |    1000 | /dev/drbd1000 |  979.58 MiB | Unused |   UpToDate |
| k02  | pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 | vg0-thin             |     0 |    1000 | /dev/drbd1000 |  979.58 MiB | InUse  |   UpToDate |
| k04  | pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 | vg0-thin             |     0 |    1000 | /dev/drbd1000 | 1011.14 MiB | Unused |   UpToDate |

and on node K06 diskless replica in no quorum status

$kubectl -n piraeus-datastore exec k06 -- drbdadm status |grep pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 -A1
pvc-161c5ff9-2380-4b7b-87ce-f1a9b60ce934 role:Secondary suspended:quorum
  disk:Diskless client:yes quorum:no blocked:upper
andlf commented 1 year ago

yes, there are "lost" diskless replicas

lr|grep pvc-41c68379-8188-43a5-a618-a10dfc792608
| k02  | pvc-41c68379-8188-43a5-a618-a10dfc792608 | DfltDisklessStorPool |     0 |    1017 | /dev/drbd1017 |             | InUse  |   Diskless |
| k04  | pvc-41c68379-8188-43a5-a618-a10dfc792608 | vg0-thin             |     0 |    1017 | /dev/drbd1017 |    6.48 MiB | Unused |   UpToDate |
| k05  | pvc-41c68379-8188-43a5-a618-a10dfc792608 | vg0-thin             |     0 |    1017 | /dev/drbd1017 |   22.41 MiB | Unused |   UpToDate |
and on the another node is lost qourume replica:
kubectl -n piraeus-datastore exec k06 -- drbdadm status |grep pvc-41c68379-8188-43a5-a618-a10dfc792608
pvc-41c68379-8188-43a5-a618-a10dfc792608 role:Secondary suspended:quorum

How could this happen? How can i remove it?

kubectl -n piraeus-datastore exec k06 -- drbdadm disconnect pvc-41c68379-8188-43a5-a618-a10dfc792608
'pvc-41c68379-8188-43a5-a618-a10dfc792608' not defined in your config (for this host).
command terminated with exit code 1

linstor r d k06 pvc-16a4bbb9-0cb4-4c06-a9e4-3e69f56a3808
WARNING:
Description:
    Node: k06, Resource: pvc-16a4bbb9-0cb4-4c06-a9e4-3e69f56a3808 not found.
Details:
    Node: k06, Resource: pvc-16a4bbb9-0cb4-4c06-a9e4-3e69f56a3808
WanzenBug commented 1 year ago

Run

drbdsetup down <resource-name>

As to how this could happen, I have no idea :/