kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.98k stars 628 forks source link

health-checker-containerd fails on bottlerocketOS #707

Closed michalschott closed 7 months ago

michalschott commented 2 years ago
# /home/kubernetes/bin/health-checker --component=cri
I1005 12:01:11.493394      43 health_checker.go:172] command /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest failed: fork/exec /usr/bin/crictl: no such file or directory, []
I1005 12:01:11.496574      43 health_checker.go:172] command /bin/systemctl show containerd --property=InactiveExitTimestamp failed: exit status 1, []
I1005 12:01:11.496740      43 health_checker.go:86] error in getting uptime for cri: exit status 1
cri:containerd was found unhealthy; repair flag : true

# /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest
sh: 18: /usr/bin/crictl: not found

# /bin/systemctl show containerd --property=InactiveExitTimestamp
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
michalschott commented 2 years ago

Based on https://bytemeta.vip/repo/kubernetes/node-problem-detector/issues/683 I've managed to solve it with custom NPD container.

Dockerfile:

FROM registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.12 as builder

# Install crictl
ARG TARGETOS
ARG TARGETARCH
#`BUILDX_ARCH` will be used in the buildx package download URL
# The required format is in `TARGETOS-TARGETARCH`
# Set it default to linux-amd64 to make the Dockerfile
# works with / without buildkit
ENV BUILDX_ARCH="${TARGETOS:-linux}-${TARGETARCH:-amd64}"

ARG VERSION="v1.25.0"
RUN apt-get -qq update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -qq -y curl unzip < /dev/null > /dev/null && \
    rm -rf /var/cache/apt/* && \
    curl -sLO https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-${VERSION}-${BUILDX_ARCH}.tar.gz && \
    tar zxvf crictl-$VERSION-${BUILDX_ARCH}.tar.gz -C /usr/bin && \
    rm -f crictl-$VERSION-${BUILDX_ARCH}.tar.gz && \
    apt-get -qq autoremove curl unzip

Update daemonset manifest with:

spec.template.spec.containers.0.volumeMounts:
  - mountPath: /var/run/containerd/containerd.sock
    name: containerd

spec.template.spec.volumes:
  - name: containerd
    hostPath:
      path: /run/dockershim.sock
      type: Socket

Still would be handy to have CRI installed in NPD out of the box.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

michalschott commented 1 year ago

/remove-lifecycle stale

btiernay commented 1 year ago

@michalschott Did you ever make any progress with this?

# /bin/systemctl show containerd --property=InactiveExitTimestamp
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
michalschott commented 1 year ago

@btiernay never had that problem once I build own container and updated manifest - make sure you have updated mountPaths.

btiernay commented 1 year ago

@michalschott I'm curious how you got around the SELinux constraints in Bottlerocket with systemctl. I had install a couple of new packages:

RUN apt-get -qq update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -qq -y --allow-change-held-packages libcap2 systemd strace < /dev/null > /dev/null && \
    rm -rf /var/cache/apt/* && \
    curl -sLO https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-${VERSION}-${BUILDX_ARCH}.tar.gz && \
    tar zxvf crictl-$VERSION-${BUILDX_ARCH}.tar.gz -C /usr/bin && \
    rm -f crictl-$VERSION-${BUILDX_ARCH}.tar.gz && \
    apt-get -qq autoremove curl unzip

And then run with SYSTEMD_IGNORE_CHROOT=1 in the environment. But even still, I hit MAC issues after configuring SELinux labels in my DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
spec:
  template:
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
        - name: node-problem-detector
          securityContext:
            privileged: true
            seLinuxOptions:
              user: system_u
              role: system_r
              type: super_t
              level: s0

Curious how you got around that.

FYI - continuing the discussion with Bottlerocket community here: https://github.com/bottlerocket-os/bottlerocket/discussions/3156

And for the overall future of the integration here: https://github.com/bottlerocket-os/bottlerocket/discussions/3156

Please chime if you are so inclined!

btiernay commented 1 year ago

FYI: Was able to get this to work per https://github.com/bottlerocket-os/bottlerocket/discussions/3156. The key point was removing privileged: true.

michalschott commented 1 year ago

@btiernay I do not set seLinuxOptions key, but glad you sorted this out.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/707#issuecomment-2013418758): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.