CrashLoopBackOff in some csi-azurelustre-node pods

varameshgithub commented 12 months ago

What happened:

We are facing intermittent CrashLoopBackOff in some of the csi-azurelustre-node pods in AKS worker nodes. Not facing this issue very often but occurs intermittently usually when we scale up the number of nodes to a higher number (say from 4 to 300 nodes) and sometimes during onboarding a new AKS cluster. While creating the new node some of the csi-azurelustre-node goes into CrashLoopBackOff and restarting multiple times with the same error.

_2023-10-22T07:30:46.79512136Z stderr F + [[ yes == \y\e\s ]] 2023-10-22T07:30:46.795371005Z stderr F ++ uname -r 2023-10-22T07:30:46.795924737Z stderr F + kernelVersion=5.15.0-1049-azure 2023-10-22T07:30:46.796250124Z stderr F ++ date -u 2023-10-22T07:30:46.796921159Z stdout F Sun Oct 22 07:30:46 UTC 2023 Installing Lustre client packages for OS=jammy, kernel=5.15.0-1049-azure 2023-10-22T07:30:46.796930232Z stderr F + echo 'Sun Oct 22 07:30:46 UTC 2023 Installing Lustre client packages for OS=jammy, kernel=5.15.0-1049-azure ' 2023-10-22T07:30:46.796940876Z stderr F + '[' '!' -f /etc/apt/sources.list.d/amlfs.list ']' 2023-10-22T07:30:46.79728655Z stderr F ++ date -u 2023-10-22T07:30:46.798187775Z stdout F Sun Oct 22 07:30:46 UTC 2023 Installing Lustre client modules: amlfs-lustre-client-2.15.1-29-gbae0abe=5.15.0-1049-azure 2023-10-22T07:30:46.798192133Z stderr F + echo 'Sun Oct 22 07:30:46 UTC 2023 Installing Lustre client modules: amlfs-lustre-client-2.15.1-29-gbae0abe=5.15.0-1049-azure' 2023-10-22T07:30:46.798198395Z stderr F + DEBIANFRONTEND=noninteractive 2023-10-22T07:30:46.798202182Z stderr F + apt install -y --no-install-recommends -o DPkg::options::=--force-confdef -o DPkg::options::=--force-confold amlfs-lustre-client-2.15.1-29-gbae0abe=5.15.0-1049-azure 2023-10-22T07:30:46.801695387Z stderr F 2023-10-22T07:30:46.801703462Z stderr F WARNING: apt does not have a stable CLI interface. Use with caution in scripts. 2023-10-22T07:30:46.801706187Z stderr F 2023-10-22T07:30:46.960770307Z stdout F Reading package lists... 2023-10-22T07:30:47.094345613Z stdout F Building dependency tree... 2023-10-22T07:30:47.094834214Z stdout F Reading state information... 2023-10-22T07:30:47.108013087Z stderr F E: Unable to locate package amlfs-lustre-client-2.15.1-29-gbae0abe 2023-10-22T07:30:47.108022515Z stderr F E: Couldn't find any package by glob 'amlfs-lustre-client-2.15.1-29-gbae0abe'

What you expected to happen:

csi-azurelustre-node* should ideally be in a healthy state.

How to reproduce it:

Issue is intermittent and doesn't occur in a happy day scenario. Issue occurs during csi-azurelustre-node in entrypoint.sh during docker entrypoint execution mentioned in the dockerfile. Attached the failure logs of csi-azurelustre-node CrashLoopBackOff issue above.

Anything else we need to know?:

What we suspect that happens during the csi-azurelustre-node pod start. When we checked the failure logs of multiple pods, we see the same issue mentioned in the above logs. We see that during the execution of the command apt install -y --no-install-recommends -o DPkg::options::="--force-confdef" -o DPkg::options::="--force-confold" ${pkgName}=${kernelVersion} "Unable to locate package amlfs-lustre-client-2.15.1-29-gbae0abe" is thrown.

In entrypoint.sh a new repository under /etc/apt/sources.list.d/amlfs.list is added if the file is not present and performing an apt-get update. We suspect there might be a pod restart during the apt-update operation or the apt-get update failing or other pods performing an update during the same time causing a conflict with apt since csi-azurelustre-node has /etc mounted to the host node during first pod start. This ends up in having the amlfs.list file present without apt-update getting triggered in the subsequent restart due to a condition in the entrypoint.sh there by the node ends up in an irrecoverable state.

Environment:

CSI Driver version: 2.15.1
Kubernetes version (use kubectl version): 1.26.6
OS (e.g. from /etc/os-release): jammy
Kernel (e.g. uname -a): 5.15.0-1049-azure

jofergus commented 11 months ago

Are there more details to add? Are the vm's crashing and then rebooting returning to operation? Do you have an example of the crash message?

varameshgithub commented 11 months ago

@jofergus Sometimes the pods are operational after crashing and rebooting. However, we usually need to intervene to make the pod running. It usually crashloops multiple times. When checked logs of most pods we see the same above error. If you see the attached k9s, we see multiple restarts of the pod when we upscale the number of nodes from 4 to 300.

For the reason of why the pod fails during the first start, we don't have a crash message since the pod would have restarted and failed multiple times before we try and fetch the logs.

lustre-failures

dabradley commented 11 months ago

The PR that was submitted for this issue was tested and looks good. We have merged it and we're working on testing for a new release that will also include other fixes. Thanks for the PR!

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dabradley commented 7 months ago

This PR has been included in the driver release

kubernetes-sigs / azurelustre-csi-driver

CrashLoopBackOff in some csi-azurelustre-node pods #141