optimas-org / optimas

Optimization at scale, powered by libEnsemble
https://optimas.readthedocs.io
Other
23 stars 15 forks source link

Exec format error when running Optimas with HiPACE++ on Maxwell #235

Closed lboult closed 2 weeks ago

lboult commented 4 months ago

Hi all,

I have been trying to do some simple grid scans with Optimas and HiPACE++ on Maxwell and am encountering an error that seems to be related to the submission script that Optimas (or libEnsemble?) generates e.g:

2024-07-24 11:06:59,278 libensemble.executors.mpi_executor (WARNING): task libe_task_sim_worker2_0 submit command failed on try 2 with error [Errno 8] Exec format error: './libe_task_sim_worker2_0_run.sh'

I checked and this also seems to happen with the examples also provided in the docs.

Here is an example of one of the submission scripts that are generated:

source /etc/profile.d/modules.sh     # make the module command is available

module purge
module load maxwell gcc/9.3 openmpi/4
module load maxwell cuda/11.8
module load hdf5/1.10.6
# pick correct GPU setting (this may differ for V100 nodes)
export GPUS_PER_SOCKET=2
export GPUS_PER_NODE=4
# optimize CUDA compilation for A100
export AMREX_CUDA_ARCH=8.0

export OMP_NUM_THREADS=1

mpirun -hosts max-mpag002 -np 2 --ppn 2 /home/lboulton/src/hipace/build/bin/hipace template_simulation_script

I try to submit the same script as a single batch job which also fails but then seems to suggest that --ppn isn't a valid parameter.... is this a bug or perhaps something to do with having the wrong version of openmpi? Or maybe even a peculiarity of Maxwell...

Any advice is appreciated!

Cheers, Lewis

shuds13 commented 4 months ago

@lboult

I think most likely it is picking up mpich and then in your subprocess is using openMPI (assuming you are using an env_script). The quickest way to force it is probably to specify openmpi.

In libEnsemble this would be either.

exctr = MPIExecutor(custom_info={"mpi_runner": "openmpi"})

or as a platforms spec.

libE_specs["platform_specs"] = {
    "mpi_runner": "openmpi",
}

In Optimas, if your exploration object is called exp you can probably do:

exp.libE_specs["platform_specs"] = {
    "mpi_runner": "openmpi",
}
shuds13 commented 4 months ago

Hang on, I just realised Optimas has a more direct option to set this in your calling script.

ev = TemplateEvaluator(
    env_mpi='openmpi'

If you already have this and its failing, let me know.

lboult commented 4 months ago

Hey,

Thanks for the fast response. So I think you're right that using the env_mpi='openmpi' argument is necessary- I do this now but there still seems to be some issue that throws the same 'Exec format error'.

Here's how the submission file looks like that optimas generates now:

 source /etc/profile.d/modules.sh     # make the module command is available

module purge
module load maxwell gcc/9.3 openmpi/4
module load maxwell cuda/11.8
module load hdf5/1.10.6
# pick correct GPU setting (this may differ for V100 nodes)
export GPUS_PER_SOCKET=2
export GPUS_PER_NODE=4
# optimize CUDA compilation for A100
export AMREX_CUDA_ARCH=8.0

export OMP_NUM_THREADS=1

mpirun -machinefile machinefile_autogen_for_worker_2_task_0 -np 2 -npernode 2 /home/lboulton/src/hipace/build/bin/hipace template_simulation_script

The machine file that it refers to also seems to be generated okay (I think?)

Unfortunately there's no indication in the .err or .out files about what the exact issue might be now... Let me know if I can provide more details.

Cheers, Lewis

shuds13 commented 4 months ago

What happens now if you run:

./libe_task_sim_worker2_0_run.sh

by itself?

You could try without the machinefile and/or in an interactive session, and set the node name in the machinefile to what your on. I wonder if on your system something is needed like starting the file with:

!/bin/bash

I would see if you can get it to run that file on your system. Let me know what if gives you.

You could also see if sourcing the file makes a difference.

lboult commented 4 months ago

Hey,

Sorry for the delayed reply, was doing some investigating. So submitting the file as a batch job works fine (I submit specifically to the node mentioned in the machine file as well). But when I run interactively I get these errors:

/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.22' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/lboulton/src/hipace/build/bin/hipace)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /software/gcc/9.3.0/openmpi/4.0.4/lib/libmpi_cxx.so.40)
/home/lboulton/src/hipace/build/bin/hipace: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /software/gcc/9.3.0/openmpi/4.0.4/lib/libmpi_cxx.so.40)

So it seems like something dodgy going on with environments that I'm yet to understand?

shuds13 commented 4 months ago

On some systems there are differences when you run interactively from batch, such as whether your .bashrc is run. This varies from one system to another. It seems hipace is picking up the wrong standard C++ library.

You could echo you LD_LIBARY_PATH in batch and try and replicate it.

You said you had your original issue with the libensemble docs example script. Perhaps check if that works now to see if libEnsemble is still an issue.

shuds13 commented 4 months ago

Should be fixed by https://github.com/Libensemble/libensemble/pull/1392 which runs user scripts in shell.

lboult commented 4 months ago

Sorry for the slow reply, was away for the last half of last week.

I've just now installed the version of libensemble in the git branch you referenced and can confirm that this seems to have solved my issue.

Thanks very much for your help :)