multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

ValueError: invalid literal for int() with base 10: '0,40' received when submitting to a batch node (via SLURM) #250

Open DavidPCoster opened 1 year ago

DavidPCoster commented 1 year ago

I received the following error message when I tried submitting my workflow to a SLURM batch node:

Process QCGPJProcessManager:
Traceback (most recent call last):
  File "/mpcdf/soft/SLE_12/packages/x86_64/anaconda/3/2023.03/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/cobra/u/dpc/GIT/ets_paf/UQ/muscle3_venv/lib/python3.10/site-packages/libmuscle/manager/qcgpj_instantiator.py", line 131, in run
    self._send_resources()
  File "/cobra/u/dpc/GIT/ets_paf/UQ/muscle3_venv/lib/python3.10/site-packages/libmuscle/manager/qcgpj_instantiator.py", line 211, in _send_resources
    resources.cores[node.name] = set(map(int, node.free_ids))
ValueError: invalid literal for int() with base 10: '0,40'
slurmstepd: error: *** JOB 6907863 ON co6601 CANCELLED AT 2023-06-30T15:45:27 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 6907863.0 ON co6601 CANCELLED AT 2023-06-30T15:45:27 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.

Since the path is to libmuscle I assume this is the tight starting point. I realize that this might end up being a qcgpj issue, but I thought I would start here.

LourensVeen commented 1 year ago

I'm using QCG-PJ in a rather non-standard way, so I very much prefer any issues be reported here. If it does turn out to be a QCG-PJ issue then I'll go file an issue there.

This does look like we're getting a funny resource description from QCG-PJ, but it could be only funny in the sense that I'm not expecting it even though it makes sense.

Could you try this with --log-level DEBUG as an additional option to muscle_manager? That should give a lot of log output from QCG-PJ about how it's interpreting the SLURM environment variables. If it's a lot of output and/or you don't want to share it here publicly then feel free to email it to me.

Browsing through the QCG-PJ code, it seems like this will print a line starting with cpu list on each node: that would be of interest.

DavidPCoster commented 1 year ago

Attached please find the muscle3 manager log after setting the log level to debug:

muscle3_manager.log

Thanks, Dave.

LourensVeen commented 1 year ago

After decyphering the QCG-PJ source a bit I think that this is related to hyperthreading. I'm expecting to get a string with a single int containing a core id, but I'm getting 0, 40 designating two hyperthreads on a single core. Hyperthreads, cores, and MPI tasks are all getting a bit mixed up here, and it's hard to see what the intent of the code is. Also, I need to think about what the intent should be, given that we may not be doing MPI and may or may not want to do hyperthreading on a per-submodel basis. And then how that relates to the resource requirements in the yMMSL file.

Meanwhile, are you starting with with --ntasks-per-node=1? Could you try with --ntasks-per-node=40? You've got 40 cores after all, and may as well use them, plus that should nudge QCG-PJ down another code path and may just avoid the problem. I've run MUSCLE3 on machines with hyperthreading enabled with --ntasks-per-node set to the number of physical cores and it worked there.

DavidPCoster commented 1 year ago

I changed --ntasks-per-node=1 to --ntasks-per-node=1, and commented out the --cpus-per-task=40

We get some different errors ...

muscle3_manager.log fusion_qmc.err.6914444.txt

LourensVeen commented 1 year ago

I think we have 40 instances of the manager being started now. That's overdoing it slightly :smile:. Are you starting muscle_manager through srun? Could you post your batch script maybe?

DavidPCoster commented 1 year ago

Yes & yes. fusion_qmmc.slurm.txt

(.txt added so that I can upload the file)

LourensVeen commented 1 year ago

Okay, yes, removing srun from in front of muscle_manager should help. Then you'll have one instance of muscle_manager, with an environment that specifies that we want 40 tasks at most in total. QCG-PJ will pick that up as having 40 cores (slots) available, and MUSCLE3 will then suballocate from there to place the instances.

MUSCLE3 will set OMP_NUM_THREADS for submodels according to their resource requirements, so there's no need to set that in the script.

To do:

DavidPCoster commented 1 year ago

That seems to have fixed the problem -- thanks!

LourensVeen commented 1 year ago

Okay, nice! Let's leave this issue open for the documentation improvement, it should really be easier than this.

LourensVeen commented 4 months ago

It turns out that on some machines, hyperthreading is just always enabled. So we should probably also update MUSCLE3 to accept hyperthreaded resource descriptions from QCG-PJ. Just using the first logical core in the thread group and ignoring the rest should work well enough for what we're doing.