Closed tazend closed 8 months ago
Also enable
config_overrides
in the slurm.conf for the slurmd, so we can declare an arbitrary amount of resources for the node, e.g 56 CPUs and 512GB of RAM, even though it isn't physically available.
@tazend thank you for this PR! We are testing it for our use case. Would you be so kind to also add to README.md some instructions on how to make use of parametrisations you have added and to provide those config_overrides when needed?
@tazend would you mind updating the README in a follow up PR? Thanks!
@tazend I'm happy to give an unfamiliar-eyes-review for a README update, I'm excited to try out config_overrides :)
Hi @giovtorres,
yeah I will try to update the README soon
Version updates
openssl
,tini
andpython
Multiple node support
Compiles slurm with
--multiple-slurmd
. Without this option, you cannot correctly operate multiple nodes on a single host. Multiple nodes were previously configured in the slurm.conf, but this was always yielding lots of errors (even though it appeared to be working when requesting the nodes). But doing something likesrun
in a multi-node job would always fail - which now works correctly with--multiple-slurmd
.Therefore
supervisord.conf
has also been adapted to start, by default, 3 instances ofslurmds
.Changes is the entrypoint file
supervisorctl start
, it will already report whether the service has started or failed. No need to repeatedly spam the state.Other changes
gres.conf
and the GPUs attached to the nodes have been removed for now - same reason as above with the multiple nodes - You can request it, but it will probably yield some errors, since there is no GPU device file available (maybe need to do some more testing and perhaps could add it back later)cgroup.conf
has been added, with explicity setting the cgroup version to be used to v1. Without it, slurm did not want to start.Also enable
config_overrides
in the slurm.conf for the slurmd, so we can declare an arbitrary amount of resources for the node, e.g 56 CPUs and 512GB of RAM, even though it isn't physically available.