mej / nhc

LBNL Node Health Check
Other
229 stars 79 forks source link

NVVS (part of NVIDIA DCGM) has replaced nv-healthmon. NHC will fail on new GPUs w/o code mods #29

Open jrcoombs opened 7 years ago

jrcoombs commented 7 years ago

NVVS (part of NVIDIA's DCGM: Data Center GPU Manager) is the replacement for nv-healthmon, which is deprecated and unsupported for new and future NVIDIA hardware. Health checking for Pascal microarchitecture (P100/P4/P40 and later) NVIDIA GPUs installed on clusters using NHC will fail without appropriate modifications to NHC.

DCGM link: http://www.nvidia.com/object/data-center-gpu-manager.html

I can put you in direct contact with the DCGM engineering team at NVIDIA and get you the appropriate GPUs for your development and testing. When you are interested, just send me an email.

John Coombs Tesla BU Alliance Management NVIDIA jcoombs@nvidia.com

jmcculloch4 commented 6 years ago

How would I acquire the Release Candidate referenced in Nvidia document DU-07862-001_v1.3, page 25... We are testing a GPU cluster and are looking for more verbose output from 'dcgmi diag -r 3'. As it only returns "PCIe Fail - All" which is too vague to be helpful.

bstollenvidia commented 6 years ago

All DCGM packages and docs can be obtained here: https://developer.nvidia.com/data-center-gpu-manager-dcgm

mick-t commented 1 year ago

FYI, nvvs doesn't seem to work with MIG enable GPU's:

/usr/share/nvidia-validation-suite/nvvs

DCGM GPU Diagnostic (version 418)

GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
mick-t commented 1 year ago

If you need help testing any new tools to check on nvidia cards I can help.

jrcoombs commented 1 year ago

I am no longer with NVIDIA. (I retired in 2020.) Duncan (copied) can tell you who to be in touch with there.   JohnSent by John from his mobileOn May 24, 2023, at 16:30, Mick T. @.***> wrote: If you need help testing any new tools to check on nvidia cards I can help.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>