Open LourensVeen opened 2 years ago
Looking at the Slurm source code, it seems that the check for none
was last touched in 2016, and it's been in there for longer. Digging deeper...
Ah, so it seems that Slurm 17.02 only supports --cpu_bind
(with an underscore) and not --cpu-bind
(with a hyphen). So it's not about the value, it's about the whole argument.
Recent versions of Slurm still support --cpu_bind
because OpenMPI had it hardcoded for a long time apparently. There's a complaint in the code about them having to continue to support it :-). So I guess QCGPJ could use that for better compatibility, although there's a chance they'll remove the underscore version in the future of course...
Thanks Lourens for debugging the problem - oh how I love the little changes in Slurm and their compatibility with previous versions. I will try to fix this.
I have an idea how to fix this (by checking the slurm version) but as we have to prepare the release today I would not like to make too deep changes at the last minute - so I will prepare the fix after the release.
That's fine, good luck with the release!
I've prepared the fix in #150.
I've discovered that recent versions of OpenMPI set the CPU_BIND environment variable instead of passing command line arguments. That may be an option in the future if --cpu_bind
gets removed from Slurm.
Enabling debug logging like so:
produced a file
nl-node053-start-agent-stderr.log
in the run directory containing the error:This is a somewhat antiquated cluster. Is Slurm 17.02 still supported by QCG PilotJob?