kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.83k stars 613 forks source link

Feature Request: GPU Support #833

Open ZongqiangZhang opened 8 months ago

ZongqiangZhang commented 8 months ago

This feature request aims to enhance the Node Problem Detector with the ability to monitor GPUs on nodes and detect issues.

Currently NPD does not have direct visibility into GPUs. However, many workloads are GPU accelerated which makes GPU health an important part of node health. e.g. GPUs are widely used in machine learning training and inference. Especially for LLM training which may using tens of thousands of GPU cards. The entire training cluster should be restarted from previous checkpoint if any one of the GPUs in the cluster is gone bad.

This feature request adds the following capabilities:

Specifically, this feature request includes:

Looking forward to your feedback!

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

zz913922 commented 4 months ago

Exactly the same situation, did you resolve this issue already?

stmcginnis commented 3 months ago

/remove-lifecycle stale

stmcginnis commented 3 months ago

/remove-lifecycle rotten

AllenXu93 commented 3 months ago

I think NPD should support different config for different device or runtime; For example , I have both GPU and none GPU worker node in one cluster, or containerd and docker nodes in one cluster; currently we need to deploy 2 npd daemonset for different type of node;

AllenXu93 commented 3 months ago

BTW, for GPU, we don't need to install more dependencies, it just add env NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES , and use nvidia-smi command to check GPU state

wangzhen127 commented 3 months ago

Thanks for filing the feature request! I think this totally makes sense. Do you have any more concrete proposal?

/cc @SergeyKanzhelev

SergeyKanzhelev commented 3 months ago

yes, accelerators health is an important functionality and would be great to have it in NPD

Need to design it carefully though. There is already some health checking in a device plugin (like https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/rm/health.go#L39) that we need to work nicely with. Even simple detection of a device plugin health is a good starting point here.

@AllenXu93 @ZongqiangZhang do you want to work on more detailed design? I definitely will be interested to join the effort

wangzhen127 commented 3 months ago

/kind feature

AllenXu93 commented 2 months ago

yes, accelerators health is an important functionality and would be great to have it in NPD

Need to design it carefully though. There is already some health checking in a device plugin (like https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/rm/health.go#L39) that we need to work nicely with. Even simple detection of a device plugin health is a good starting point here.

@AllenXu93 @ZongqiangZhang do you want to work on more detailed design? I definitely will be interested to join the effort

Of cource. In our case, we use nvidia-smi to check GPU remapped row pending and failure (accroding to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/#error-recovery-and-response-flags ), mark condition of node. When it occurr, we will create a job that drain node and execute gpu reset. So we need NPD to check GPU health and mark node condition.

xuchenCN commented 1 month ago

LGTM + 1

AllenXu93 commented 1 month ago

您好,邮件已收到,我会尽快给您回复。