openhpc / submissions

OpenHPC Component Submissions Project
8 stars 2 forks source link

LBNL NHC #24

Closed novosirj closed 6 years ago

novosirj commented 6 years ago

Software Name

LBNL NHC


Public URL

https://github.com/mej/nhc


Technical Overview

Functionality was removed from OpenHPC when NHC was separated from the Warewulf project. I'm told that NHC was removed in OpenHPC 1.3 as a result of its removal from Warewulf. Adding LBNL NHC would restore this missing functionality to OpenHPC.


Latest stable version number

1.4.2


Open-source license type

DOE/LBNL IPO? I'm not sure how to interpret this, so here is the license: https://github.com/mej/nhc/blob/master/LICENSE


Relationship to component?

If other, please describe:


Build system

If other, please describe:

Does the current build system support staged path installations? For example: make install DESTIR=/tmp/foo (or equivalent)


Does component run in user space or are administrative credentials required?


Does component require post-installation configuration.

If yes, please describe briefly:

Nodes need to have their hardware configuration appropriately detailed, though there is configuration generator binary (nhc-genconf) as of version 1.4.1 that can be run on a node to generate parameters.


If component is selected, are you willing and able to collaborate with OpenHPC maintainers during the integration process?


Does the component include test collateral (e.g. regression/verification tests) in the publicly shipped source?

If yes, please briefly describe the intent and location of the tests.

_Per the documentation: NOTE: The 'make test' step is optional but recommended. This will run NHC's built-in unit test suite to make sure everything is functioning properly!"


Does the component have additional software dependencies (beyond compilers/MPI) that are not part of standard Linux distributions?

If yes, please list the dependencies and associated licenses.

NVIDIA tests require nvidia-healthmon, integration with the scheduler requires those utilities, and some other software (like dmidecode) is part of the standard Linux installation but might not be part of a stripped-down compute node image.


Does the component include online or installable documentation?

If available online, please provide URL.

https://github.com/mej/nhc/blob/master/README.md


[Optional]: Would you like to receive additional review feedback by email?

- [x] yes - [ ] no
scottaltair commented 6 years ago

@novosirj - Thank you for your submission. I read the referenced LBNL NHC github page and it talks about SLURM, TORQUE, Grid Engine, but there is no reference to PBS Professional. Have you tried supporting NHC with PBS Professional open source (www.pbspro.org, which has been part of OpenHPC since v1.2)?

I am asking this question with the expectation of this component supporting multiple resource managers/job schedulers that are available within OpenHPC.

novosirj commented 6 years ago

I'm not 100% sure, though the sample config file mentions PBS Pro, and you can see in that file what sorts of checks there are that are TORQUE/PBS Pro specific: https://github.com/mej/nhc/blob/master/nhc.conf I don't know if there were any changes in PBS Pro that would matter -- I've never run it (only TORQUE). Given that it only runs PBS commands for its checks, I believe, it shouldn't be a problem so long as the command names haven't changed.

scottaltair commented 6 years ago

Thanks, @novosirj. I know that when the Adaptive folks enhanced some of pbs commands in the PBS variant called TORQUE this resulted in some of the pbs commands being no longer compatible with PBS Professional. Not necessarily a show stopper, but a gap.

When you registered the component with OHPC, you identified yourself as a user of the component. Do you think that the developer(s) of the component would be open to extending the "PBS" support to include PBS Professional open source? This would allow for PBS Professional open source and commercial customers to use this useful tool.

novosirj commented 6 years ago

I will reach out to the developer. I suspect what they will say is that it already works with with PBSPro. Perhaps they can edit their site if that is true to make that clearer.

Most of what NHC does is check non-scheduler functioning and tells the scheduler to offline the node if it fails one of those checks. So there are a couple of scheduler-related checks, but the rest are checks for other conditions (memory free, CPU load, NFS mounts, MCE log, or other arbitrary commands and doing regexes on the output). The primary two things that are scheduler-specific are that the scheduler can spawn a health check periodically (but it can be run by cron or as a daemon if not), and that there is a command for offlining a node (which I believe is user definable but defaults to pbsnodes -o).

Anyway, I’ll reach out today and see what they say.

koomie commented 6 years ago

Thank you for the submission and additional responses to follow up questions. The TSC has recommended acceptance of NHC.