sxs-collaboration / spectre

SpECTRE is a code for multi-scale, multi-physics problems in astrophysics and gravitational physics.
https://spectre-code.org
Other
161 stars 191 forks source link

Scheduler requires mpi use on head node #6183

Open guilara opened 3 months ago

guilara commented 3 months ago

Bug reports:

The scheduler (spectre schedule) command checks that the executable can parse correctly the input file on the head node (this requires to use MPI on the head node). Once this is done, the submit script is generated and submission to the queue proceeds.

On our cluster Urania, however, we need to load an interactive mpi module to run it on the head node (and thus to allow the scheduler to validate the input file). However, this module prevents sbatch to submit jobs (which is the next step):

(env) guilara@urania02:/urania/.../RunDir> sbatch Submit.sh 
You cannot use 'sbatch' while the module impi-interactive is loaded!

We are not sure of other clusters will have similar problems.

I suggest to remove the lines where the input file is validated (https://github.com/sxs-collaboration/spectre/blob/a6a8ee404306bec9d92da8ab89f636b037aefc25/support/Python/Schedule.py#L544). Or perhaps validate directly on the compute nodes before execution?

Expected behavior:

Current behavior:

Environment:

Add as an attachment $SPECTRE_BUILD_DIR/BuildInfo.txt or add its contents here.

Feature request:

Component:

Desired feature:

Detailed discussion:

nilsvu commented 3 months ago

As a quick workaround, let's add a --no-validate flag to skip validation. You can even use a config file to always pass this flag to the CLI automatically on the cluster (see spectre --help).

Running the validation on the compute node defeats its purpose, because you want to validate the input file at job submission so you don't have to wait until the job has gone through the queue only two have it fail with a syntax error in the input file. If you can't run an executable and sbatch with the same modules on the head node I'm not sure what to do. Can you run just the executable without mpirun on the headnode?

guilara commented 3 months ago

@nilsvu I like the idea of the flag. Unfortunately, the executable doesn't run on the head node even without mpirun (unless I load the interactive module ofc). Right, I understand that it defeats the purpose of validation if its done on the compute node. But not sure at the moment what could be another fix.

nilsvu commented 3 months ago

Ok can you add the --no-validate flag to work around this then? You can add it to Schedule.py and the scheduler_options.

guilara commented 3 months ago

Will do