piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
383 stars 60 forks source link

Back-off restarting failed container drbd-shutdown-guard (2.0.1) #426

Open vasyakrg opened 1 year ago

vasyakrg commented 1 year ago

after up ver to 2.0.1

Back-off restarting with:

2023/03/13 10:20:03 failed: failed to reload systemd
2023/03/13 10:20:52 Running drbd-shutdown-guard version v1.0.0
2023/03/13 10:20:52 Creating service directory '/run/drbd-shutdown-guard'
2023/03/13 10:20:52 Copying drbdsetup to service directory
2023/03/13 10:20:52 Copying drbd-shutdown-guard to service directory
2023/03/13 10:20:52 Optionally: relabel service directory for SELinux
2023/03/13 10:20:52 ignoring error when setting selinux label: exit status 127
2023/03/13 10:20:52 Creating systemd unit drbd-shutdown-guard.service in /run/systemd/system
2023/03/13 10:20:52 Reloading systemd
Error: failed to reload systemd
Usage:
  drbd-shutdown-guard install [flags]

Flags:
  -h, --help   help for install

2023/03/13 10:20:52 failed: failed to reload systemd

in LinstorSatellite

OS on host:

andlf commented 1 year ago

I have this problem on Debian 11

WanzenBug commented 1 year ago

I can't reproduce this on a simple Ubuntu 22.04 cluster. I use containerd and kubeadm to create the cluster. Is there anything in any of the system logs? Perhaps app-armor interfering?

The step where it fails, the init container would execute systemctl daemon-reload. Perhaps there is some permission error from inside the container.

In any case, a workaround for now:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: no-drbd-shutdown-guard
spec:
  patches:
  - target:
      kind: Pod
      name: satellite
    patch: |
      apiVersion: v1
      kind: Pod
      metadata:
        name: satellite
      spec:
        initContainers:
        - name: drbd-shutdown-guard
          $patch: delete
vasyakrg commented 1 year ago

I can't reproduce this on a simple Ubuntu 22.04 cluster. I use containerd and kubeadm to create the cluster. Is there anything in any of the system logs? Perhaps app-armor interfering?

The step where it fails, the init container would execute systemctl daemon-reload. Perhaps there is some permission error from inside the container.

In any case, a workaround for now:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: no-drbd-shutdown-guard
spec:
  patches:
  - target:
      kind: Pod
      name: satellite
    patch: |
      apiVersion: v1
      kind: Pod
      metadata:
        name: satellite
      spec:
        initContainers:
        - name: drbd-shutdown-guard
          $patch: delete

yes, its work. cluster up and ready. i am used rke2 to create k8s cluster

vasyakrg commented 1 year ago

logs clear. only run and stop for its container

Mar 13 12:56:57 rke2-node1 systemd[1]: cri-containerd-81ef8139d96334565e7ad0c6a7255f767b6f76438dcb4a42c966d66cb1e886e7.scope: Deactivated successfully.
Mar 13 12:56:57 rke2-node1 systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-81ef8139d96334565e7ad0c6a7255f767b6f76438dcb4a42c966d66cb1e886e7-rootfs.mount: Deactivated successfully.
Mar 13 12:57:39 rke2-node1 systemd[1]: Started libcontainer container c09dc34921d87ce927da8fd87d55f424ef778425f6008f3da9a133040b2f20a4.
Mar 13 12:57:39 rke2-node1 systemd[1]: cri-containerd-c09dc34921d87ce927da8fd87d55f424ef778425f6008f3da9a133040b2f20a4.scope: Deactivated successfully.
Mar 13 12:57:40 rke2-node1 systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-c09dc34921d87ce927da8fd87d55f424ef778425f6008f3da9a133040b2f20a4-rootfs.mount: Deactivated successfully.
WanzenBug commented 1 year ago

Any special security configuration? I just ran the rke2 setup with default settings and it seemed to start fine :/

vasyakrg commented 1 year ago

nope. default install from hands.

Nosmoht commented 1 year ago

Hi all,

how can v2.0.1 be used on systems without Systemd like Talos? Isn't it a bad idea at all to add an external OS dependency?

WanzenBug commented 1 year ago

See the above config "patch". You may also need to remove the host-mounts.

Isn't it a bad idea at all to add an external OS dependency?

The depenceny was deemed worth it, as most users will have systemd installed, and the shutdown guard solves an issue many users will run into once they shut down their resources without evicting all pods beforehand.

vasyakrg commented 1 year ago

and how can I run the operator and create a cluster on a system with docker.io?

if to up the k8s-cluster via rke (first cli version from Rancher). there all cluster components are also in containers.

I write the mount in the configuration of k8s, the linstor cluster goes up and even creates disks, but they are in RO-access

kubelet:
    extra_binds:
      - "/usr/lib/modules:/usr/lib/modules"
      - "/var/lib/piraeus-datastore:/var/lib/piraeus-datastore"
WanzenBug commented 1 year ago

What exactly is RO? The volumes created by using a piraeus storage class? This seems to be a separate issue. Please create a new issue for it.

VadimkP commented 7 months ago

I got the same problem Exactly the same error in the drbd-shutdown-guard log K8s version 1.28.5 Сilium cni I try with crio and containerd

Log operators pod 2024-02-16T09:54:27Z ERROR Reconciler error {"controller": "linstorcluster", "controllerGroup": "piraeus.io", "controllerKind": "LinstorCluster", "LinstorCluster": {"name":"linstorcluster"}, "namespace": "", "name": "linstorcluster", "reconcileID": "dfe748f6-b19d-4e25-945f-a69660e3753f", "error": "context deadline exceeded"}

2024-02-17 20 21 19 2024-02-17 20 21 08

WanzenBug commented 6 months ago

What host OS are you using?

VadimkP commented 6 months ago

What host OS are you using?

Ubuntu 20.04

WanzenBug commented 6 months ago

:thinking: Also using RKE?

We can probably make shutdown-guard ignore these kinds of errors, but I want to make sure it is active in as many cases as possible, as it is a very useful feature...

VadimkP commented 6 months ago

no RKE Simple k8s cluster consisting of three nodes for internal tests cn and workers combine

danoh commented 6 months ago

Hello, exactly the same situation on Debian 12 + K8n + cri-o Client Version: v1.29.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2 Fresh install on clean machine

workaround from comment https://github.com/piraeusdatastore/piraeus-operator/issues/426#issuecomment-1465934727 fixed it.