Open fbsb opened 3 years ago
/sig node
We could act on the FailedToMakePodDataDirectories
and FailedMountVolume
events, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1696-L1710, and update the node ReadyCondition so pods will not be scheduled. But this should only be for a finite amount of time otherwise the node will not be allowed to recover.
Not sure if my assessment is on the right track, but if any work needs to be done here, I'd be happy to take it up :)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
/triage accepted /area kubelet /priority important-longterm
We could act on the
FailedToMakePodDataDirectories
andFailedMountVolume
events, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1696-L1710, and update the node ReadyCondition so pods will not be scheduled. But this should only be for a finite amount of time otherwise the node will not be allowed to recover.Not sure if my assessment is on the right track, but if any work needs to be done here, I'd be happy to take it up :)
/assign @lyzs90 Your approach seems reasonable to me. Please go ahead with the implemention.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
I just noticed that node-problem-detector can detect read-only filesystems as well. It appears to just look for the message "Remounting filesystem read-only" in the kernel log, which isn't necessarily coming from a relevant filesystem. It seems more useful to me to detect it directly in kubelet and mark the node as NotReady there.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
I still think that such a thing is necessary. If there is guidance to use NPD for this, then this is fine for me, then one could focus on improving the logic in NPD. Any opinions here?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
What happened:
We noticed on our clusters on bare-metal that nodes do not become
NotReady
after a read only remount by the kernel e.g. due to a filesystem corruption. This causes pods to get scheduled on the node but fail to start as the kubelet cannot create directories for the pod.What you expected to happen:
Kubelet should notice that it cannot write to the filesystem and prevent further pods from being scheduled on the node.
How to reproduce it (as minimally and precisely as possible):
kubectl get pod NAME READY STATUS RESTARTS AGE hello-1-657cb9b9f5-brbf4 1/1 Running 0 8m41s hello-2-7ddff58f66-6mgbm 0/1 ContainerCreating 0 16s
kubectl describe pod hello-2-7ddff58f66-6mgbm ... Events: Type Reason Age From Message
Normal Scheduled 33s default-scheduler Successfully assigned default/hello-2-7ddff58f66-6mgbm to minikube-m02 Warning Failed 9s (x3 over 33s) kubelet error making pod data directories: mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system Warning FailedMount 1s (x7 over 33s) kubelet MountVolume.SetUp failed for volume "default-token-5fjs5" : mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system