psnc-qcg / QCG-PilotJob

The QCG Pilot Job service for execution of many computing tasks inside one allocation
Apache License 2.0
11 stars 2 forks source link

Node agent fails to start on Slurm 17.02 #146

Open LourensVeen opened 2 years ago

LourensVeen commented 2 years ago
from qcg.pilotjob.api.manager import LocalManager as QCGManager
qcg_manager = QCGManager()
# wait for 10 minutes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lveen/qcgpj_venv/lib/python3.6/site-packages/qcg/pilotjob/api/manager.py", line 741, in __init__
    raise errors.ServiceError('Service not started')
qcg.pilotjob.api.errors.ServiceError: Service not started

Enabling debug logging like so:

from qcg.pilotjob.api.manager import LocalManager as QCGManager
qcg_manager = QCGManager(['--log', 'debug'])

produced a file nl-node053-start-agent-stderr.log in the run directory containing the error:

/cm/shared/apps/slurm/17.02.2/bin/srun: unrecognized option '--cpu-bind=none'
Try "srun --help" for more information

This is a somewhat antiquated cluster. Is Slurm 17.02 still supported by QCG PilotJob?

LourensVeen commented 2 years ago

Looking at the Slurm source code, it seems that the check for none was last touched in 2016, and it's been in there for longer. Digging deeper...

LourensVeen commented 2 years ago

Ah, so it seems that Slurm 17.02 only supports --cpu_bind (with an underscore) and not --cpu-bind (with a hyphen). So it's not about the value, it's about the whole argument.

LourensVeen commented 2 years ago

Recent versions of Slurm still support --cpu_bind because OpenMPI had it hardcoded for a long time apparently. There's a complaint in the code about them having to continue to support it :-). So I guess QCGPJ could use that for better compatibility, although there's a chance they'll remove the underscore version in the future of course...

pkopta commented 2 years ago

Thanks Lourens for debugging the problem - oh how I love the little changes in Slurm and their compatibility with previous versions. I will try to fix this.

pkopta commented 2 years ago

I have an idea how to fix this (by checking the slurm version) but as we have to prepare the release today I would not like to make too deep changes at the last minute - so I will prepare the fix after the release.

LourensVeen commented 2 years ago

That's fine, good luck with the release!

pkopta commented 2 years ago

I've prepared the fix in #150.

LourensVeen commented 2 years ago

I've discovered that recent versions of OpenMPI set the CPU_BIND environment variable instead of passing command line arguments. That may be an option in the future if --cpu_bind gets removed from Slurm.