mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

Question for faster execution: Seeing cpu_info add 10 secs to execution #141

Open jebbaxley opened 10 months ago

jebbaxley commented 10 months ago

Strangely when I add this cpu_info check the script takes 10 sec. longer to execute.

am I adding this incorrectly? Also, how can I be sure the nhc is running the checks in parrallel for faster execution? attempting to minimize health checking.

time with: real 0m11.548s user 0m0.246s sys 0m10.159s

time without: real 0m0.119s user 0m0.062s sys 0m0.018s

mej commented 10 months ago

Hey Jeb! Great to hear from you again! 😃

Not sure how I missed seeing this before... Good thing I checked the Pulse page. 😖

What version of NHC is it that you're running? For this specific check, I'd strongly recommend using the NHC 1.5 code currently in the dev branch; while 1.5 hasn't been released yet, the dev branch has a fix for this exact issue -- #121 (commit 7e2a8c6a). (At least I think that's what you're seeing.)

Feedback on the fix is definitely welcome!

You might also be able to get away with just dropping in the scripts/lbnl_hw.nhc from the dev branch. I've never tried this myself, exactly, but they should be pretty self-contained. Of course, you'd also need test/test_lbnl_hw.nhc dropped in too if you wanted to run the unit tests for the new module. Feedback on this method is also welcome, if you decide to try it.

Of course, if it would make things easier on you, I'm happy to provide snapshot tarballs and/or RPMs; just let me know!

jebbaxley commented 10 months ago

Thanks for getting back to me!  I’m currently trying to incorporate this with a new workload manager.  Is there a simple way to provide scripts that drain and undrain?  Sent from my iPhoneOn Sep 19, 2023, at 03:04, Michael Jennings @.***> wrote: Hey Jeb! Great to hear from you again! 😃 Not sure how I missed seeing this before... Good thing I checked the Pulse page. 😖 What version of NHC is it that you're running? For this specific check, I'd strongly recommend using the NHC 1.5 code currently in the dev branch; while 1.5 hasn't been released yet, the dev branch has a fix for this exact issue -- #121 (commit 7e2a8c6). (At least I think that's what you're seeing.) Feedback on the fix is definitely welcome! You might also be able to get away with just dropping in the scripts/lbnl_hw.nhc from the dev branch. I've never tried this myself, exactly, but they should be pretty self-contained. Of course, you'd also need test/test_lbnl_hw.nhc dropped in too if you wanted to run the unit tests for the new module. Feedback on this method is also welcome, if you decide to try it. Of course, if it would make things easier on you, I'm happy to provide snapshot tarballs and/or RPMs; just let me know!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

mej commented 10 months ago

Thanks for getting back to me!  I’m currently trying to incorporate this with a new workload manager.  Is there a simple way to provide scripts that drain and undrain?

In the default configuration, the scripts that handle draining/offlining and undraining/onlining nodes are node-mark-offline and node-mark-online, respectively. By default, they get installed into /usr/libexec/nhc/ (or /usr/lib/nhc/ on Debian). Modifying those scripts is one option -- and if you're considering contributing your support for this other WLM to the upstream project, this would definitely be the way to go! -- since the handling of the different RM/WLM products is pretty straightforward. Another option would be to change the values of the OFFLINE_NODE and ONLINE_NODE config variables; those control what commands NHC will use to drain or resume a node.

jebbaxley commented 10 months ago

Thanks!  I had found those as well.  I’ll ask if the team wants to push it upstream, but doubt they’ll want to as the wlm was built in house for their specific workload.  I saw frontier was released,  how’s the new cluster doing?  And how’s the team?  Hope the crazy on call has calmed downSent from my iPhoneOn Sep 20, 2023, at 00:46, Michael Jennings @.***> wrote:

Thanks for getting back to me!  I’m currently trying to incorporate this with a new workload manager.  Is there a simple way to provide scripts that drain and undrain?

In the default configuration, the scripts that handle draining/offlining and undraining/onlining nodes are node-mark-offline and node-mark-online, respectively. By default, they get installed into /usr/libexec/nhc/ (or /usr/lib/nhc/ on Debian). Modifying those scripts is one option -- and if you're considering contributing your support for this other WLM to the upstream project, this would definitely be the way to go! -- since the handling of the different RM/WLM products is pretty straightforward. Another option would be to change the values of the OFFLINE_NODE and ONLINE_NODE config variables; those control what commands NHC will use to drain or resume a node.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>