mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

Cannot find check_nvsmi_healthmon() in 1.4.3 #113

Open OleHolmNielsen opened 2 years ago

OleHolmNielsen commented 2 years ago

This new feature in 1.4.3:

check_nvsmi_healthmon(): New check from CSC for GPU health monitoring via nvidia-smi

doesn't seem to be present in the release RPM file lbnl_nv.nhc. How does one use this new check? Thanks, Ole

mej commented 2 years ago

Hey Ole!

The check_nvsmi_healthmon() check is defined at scripts/csc_nvidia_smi.nhc:27; it's not in lbnl_nv.nhc like the original nVidia checks. Do you see the same thing as that link shows, or do you have something different?

OleHolmNielsen commented 2 years ago

The file scripts/csc_nvidia_smi.nhc is absent from the 1.4.3 RPM package. Could you add it? Thanks, Ole

mej commented 2 years ago

Hey @OleHolmNielsen!

The file scripts/csc_nvidia_smi.nhc is absent from the 1.4.3 RPM package. Could you add it?

Well, yes...and no. :-)

I do see exactly what the problem is. When I merged #5, I failed to notice that the PR created the file with the check in the correct location and with the correct name but did not also add the file to the list of packaged script files in Makefile.am. All that needs to happen is for csc_nvidia_smi.nhc to be added to the nobase_dist_conf_DATA variable at that location.

Having said that... The Fine Folks ("Feyn Folks?" 🤣 🤦) at the Feynman Center for Innovation are actively reviewing my disclosure filing and my request to open-source and publish our future DOE/LANL/Triad-owned contributions directly here on GitHub. They've never had a situation quite like this before (since most of NHC is already DOE/LBNL/UC-owned and was disclosed to DOE several years ago), so to avoid making it any more complicated than it already is, I've promised not to work on NHC publicly while they're trying to get all this figured out. And while I didn't anticipate this, I'm confident that it will totally be worth it in the long run!

So I can see 2 possible solutions here:

  1. You (or someone else?) can submit a Pull Request that makes the above-described change, and I can merge it.
  2. I can apply the above fix locally myself, but I won't be able to push it to GitHub or Bitbucket until FCI completes their review/approval process.

Thoughts?

OleHolmNielsen commented 2 years ago

Hi Michael,

I've created a PR with the file scripts/csc_nvidia_smi.nhc and Makefile.am. I tried to reach out to the original author Johan Guldmyr, but he's no longer reachable. I hope this PR was made correctly?

Thanks, Ole

On 3/18/22 19:42, Michael Jennings wrote:

Hey @OleHolmNielsen https://github.com/OleHolmNielsen!

The file |scripts/csc_nvidia_smi.nhc| is absent from the 1.4.3 RPM
package. Could you add it?

Well, yes...and no. :-)

I do see exactly what the problem is. When I merged #5 https://github.com/mej/nhc/pull/5, I failed to notice that the PR created the file with the check in the correct location and with the correct name but /did not/ also add the file to the list of packaged script files https://github.com/mej/nhc/blob/d534d41db4237b018f18a3063c4ceb86b91fbe42/Makefile.am#L12-L17 in Makefile.am https://github.com/mej/nhc/blob/d534d41db4237b018f18a3063c4ceb86b91fbe42/Makefile.am. All that needs to happen is for |csc_nvidia_smi.nhc| https://github.com/mej/nhc/blob/d534d41db4237b018f18a3063c4ceb86b91fbe42/scripts/csc_nvidia_smi.nhc to be added to the |nobase_dist_conf_DATA| variable at that location.

Having said that... The Fine Folks ("Feyn Folks?" 🤣 🤦) at the Feynman Center for Innovation https://www.lanl.gov/projects/feynman-center/ are actively reviewing my disclosure filing and my request to open-source and publish our future DOE/LANL/Triad-owned contributions directly here on GitHub. They've never had a situation quite like this before (since most of NHC is already DOE/LBNL/UC-owned and was disclosed to DOE several years ago), so to avoid making it any more complicated than it already is, I've promised not to work on NHC publicly while they're trying to get all this figured out. And while I didn't anticipate this, I'm confident that it will totally be worth it in the long run!

So I can see 2 possible solutions here:

  1. You (or someone else?) can submit a Pull Request <../pulls/> that makes the above-described change, and I can merge it.
  2. I can apply the above fix locally myself, but I won't be able to push it to GitHub or Bitbucket until FCI completes their review/approval process.

Thoughts?

— Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/113#issuecomment-1072699281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE32EABSL65G4XGPUWWYNCTVATFBBANCNFSM5QKQ5V4Q. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: @.*** Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620