piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
383 stars 60 forks source link

drbd-reactor crashing #415

Open nashant opened 1 year ago

nashant commented 1 year ago

One of my satellite pods is crashlooping. It's because of the drbd-reactor pod which is giving only the following logs:

$ k logs -n piraeus-datastore server -c drbd-reactor
Error: main: core did not exit successfully

Caused by:
    sending on a disconnected channel

Any idea?

WanzenBug commented 1 year ago

Which version of the image is running? If your are not already using the v1.0.0 image, please try to upgrade to that, as that has generally better error reporting.

nashant commented 1 year ago

Yup, already using v1.0.0

nashant commented 1 year ago

Any thoughts? Can I increase logging somehow?

WanzenBug commented 1 year ago

You might be able to add a second entry to the piraeus-op-node-monitoring configmap:

data:
  log.toml: |
    [[log]]
    level = "debug"
RichardSufliarsky commented 1 year ago

Experiencing the same and even with the trace level, there is no more info: image

WanzenBug commented 1 year ago

We noticed that sometimes reactor still discards some log messages, especially the log message when creating the Prometheus socket. I assume in both cases this is related to reactor for some reason not being able to bind to [::]:9942. As for why, I cannot tell. Perhaps some strange network configuration with disabling IPv6 on the kernel level?

RichardSufliarsky commented 1 year ago

Correct, we have IPv6 disabled in the kernel: GRUB_CMDLINE_LINUX="rd.lvm.lv=rhel/root rhgb quiet ipv6.disable=1" I am also using ipFamilyPolicy: SingleStack when creating LinstorCluster:

    - target:
        kind: Service
        name: linstor-controller
      patch: |-
        apiVersion: v1
        kind: service
        metadata:
          name: linstor-controller
        spec:
          ipFamilyPolicy: SingleStack
WanzenBug commented 1 year ago

Then you probably need to patch the reactor config to use 0.0.0.0:9942 instead of the anylocal [::] address. This normally works fine even on IPv4 only systems, but directly disabling the IPv6 subsystem tends to break those.

RichardSufliarsky commented 1 year ago

Please, can you point me where can I change this globally via CRD? When I edit nodename-reactor-config config map directly and delete the pod for that nodename to restart drbd-reactor, then the address in config map gets replaced back to [::], though log.tml part with trace stays there untouched (I added that also manually).

RichardSufliarsky commented 1 year ago

Sorry, found it: https://github.com/piraeusdatastore/piraeus-operator/issues/441#issuecomment-1484970615

RichardSufliarsky commented 1 year ago

No drbd-reactor container crash since I have set Prometheus address to 0.0.0.0:9942:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: drbd-reactor-trace
spec:
  patches:
    - target:
        kind: ConfigMap
        name: reactor-config
      patch: |
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: reactor-config
          labels:
            app.kubernetes.io/component: linstor-satellite
        data:
          prometheus.toml: |
            [[prometheus]]
            enums = true
            address = "0.0.0.0:9942"

            [[log]]
            level = "trace"
          log.toml: |
            [[log]]
            level = "trace"