Open DavidPCoster opened 1 year ago
I'm using QCG-PJ in a rather non-standard way, so I very much prefer any issues be reported here. If it does turn out to be a QCG-PJ issue then I'll go file an issue there.
This does look like we're getting a funny resource description from QCG-PJ, but it could be only funny in the sense that I'm not expecting it even though it makes sense.
Could you try this with --log-level DEBUG
as an additional option to muscle_manager
? That should give a lot of log output from QCG-PJ about how it's interpreting the SLURM environment variables. If it's a lot of output and/or you don't want to share it here publicly then feel free to email it to me.
Browsing through the QCG-PJ code, it seems like this will print a line starting with cpu list on each node:
that would be of interest.
Attached please find the muscle3 manager log after setting the log level to debug:
Thanks, Dave.
After decyphering the QCG-PJ source a bit I think that this is related to hyperthreading. I'm expecting to get a string with a single int containing a core id, but I'm getting 0, 40
designating two hyperthreads on a single core. Hyperthreads, cores, and MPI tasks are all getting a bit mixed up here, and it's hard to see what the intent of the code is. Also, I need to think about what the intent should be, given that we may not be doing MPI and may or may not want to do hyperthreading on a per-submodel basis. And then how that relates to the resource requirements in the yMMSL file.
Meanwhile, are you starting with with --ntasks-per-node=1
? Could you try with --ntasks-per-node=40
? You've got 40 cores after all, and may as well use them, plus that should nudge QCG-PJ down another code path and may just avoid the problem. I've run MUSCLE3 on machines with hyperthreading enabled with --ntasks-per-node
set to the number of physical cores and it worked there.
I changed --ntasks-per-node=1 to --ntasks-per-node=1, and commented out the --cpus-per-task=40
We get some different errors ...
I think we have 40 instances of the manager being started now. That's overdoing it slightly :smile:. Are you starting muscle_manager
through srun
? Could you post your batch script maybe?
Yes & yes. fusion_qmmc.slurm.txt
(.txt added so that I can upload the file)
Okay, yes, removing srun
from in front of muscle_manager
should help. Then you'll have one instance of muscle_manager
, with an environment that specifies that we want 40 tasks at most in total. QCG-PJ will pick that up as having 40 cores (slots) available, and MUSCLE3 will then suballocate from there to place the instances.
MUSCLE3 will set OMP_NUM_THREADS
for submodels according to their resource requirements, so there's no need to set that in the script.
To do:
That seems to have fixed the problem -- thanks!
Okay, nice! Let's leave this issue open for the documentation improvement, it should really be easier than this.
It turns out that on some machines, hyperthreading is just always enabled. So we should probably also update MUSCLE3 to accept hyperthreaded resource descriptions from QCG-PJ. Just using the first logical core in the thread group and ignoring the rest should work well enough for what we're doing.
I received the following error message when I tried submitting my workflow to a SLURM batch node:
Since the path is to libmuscle I assume this is the tight starting point. I realize that this might end up being a qcgpj issue, but I thought I would start here.