Closed n01r closed 5 months ago
@n01r
I tried running the main_simulation_script.py you gave me a few months ago with latest Optimas on main. I had to make a couple changes in addtion to setting libe_comms='mpi'
in Exploration options.
In explorations/base.py I put error handling round makedirs (prevent race condition).
try:
os.makedirs(main_dir)
except Exception as e:
pass
I also had to update to pydantic 2.
pip install -U pydantic
Then on 2 nodes, this is running:
srun -N 2 -n 8 python main_simulation_script.py
This got succesfully ran exp.run()
, then errored in h = exp.history
as again trying to call from all workers what should only be done on manager. So really if using MPI comms in explorations/base.py we should extract an is_manager
value for these operations, but you could also for now put try/except round h = exp.history
.
edit: I should have run srun -N 2 -n 9 .... so gives one process to the manager!
I got python from "module load conda" This is my modules:
shuds@nid004089:try_waket_mpi$ module list
Currently Loaded Modules: 1) craype-x86-milan 7) cray-libsci/23.12.5 13) cudatoolkit/12.2 2) libfabric/1.15.2.0 8) cray-mpich/8.1.28 14) craype-accel-nvidia80 3) craype-network-ofi 9) craype/2.7.30 15) gpu/1.0 4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 10) gcc-native/12.3 16) conda/Miniconda3-py311_23.11.0-2 5) PrgEnv-gnu/8.5.0 11) perftools-base/23.12.0 6) cray-dsmml/0.2.2 12) cpe/23.12
I got nodes as follows: shuds@login14:try_waket_mpi$ salloc -N 2 -t 30 -C cpu -q interactive -A m4272
Hi @shuds13, I finally got back to trying this out.
Unfortunately, I can only make it run interactively. So far, I have not yet achieved the goal of having this run in an unsupervised job.
I also needed to add the following lines:
# somehow necessary to get MPI to work
# will otherwise complain due to missing libmpi.so.12
export LD_LIBRARY_PATH=/opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/lib-abi-mpich:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/cray/pe/mpich/8.1.28/ofi/nvidia/23.3/lib-abi-mpich:$LD_LIBRARY_PATH
But in that case I am still getting
MPICH ERROR [Rank 0] [job id 26693115.0] [Tue Jun 11 14:18:18 2024] [nid008212] - Abort(-1) (rank 0 in comm 0): MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
(Other MPI error)
aborting job:
MPIDI_CRAY_init: GPU_SUPPORT_ENABLED is requested, but GTL library is not linked
srun: error: nid008681: tasks 97-117,119-128: Segmentation fault
srun: Terminating StepId=26693115.0
slurmstepd: error: *** STEP 26693115.0 ON nid008212 CANCELLED AT 2024-06-11T21:18:18 ***
srun: error: nid008212: tasks 1-13,15-21,23-27,29-32: Segmentation fault
...
Edit: I am not sure anymore if I managed to run with export MPICH_GPU_SUPPORT_ENABLED=1
interactively, yesterday. Today I only got a simple test program to work with export MPICH_GPU_SUPPORT_ENABLED=0
.
GPU support is not super important as long as I am doing Wake-T but both the Ax generator as well as other codes that I might run will have to run with GPU support.
A simple test program reproduces the error.
test_mpi.py
This was submitted via
debug_MPI_test.sbatch
And the errors received preceded by the list of loaded modules are here
Currently Loaded Modules:
1) craype-x86-milan
2) libfabric/1.15.2.0
3) craype-network-ofi
4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta
5) PrgEnv-gnu/8.5.0
6) cray-dsmml/0.2.2
7) cray-libsci/23.12.5
8) cray-mpich/8.1.28
9) craype/2.7.30
10) gcc-native/12.3
11) perftools-base/23.12.0
12) cpe/23.12
13) cudatoolkit/12.2
14) craype-accel-nvidia80
15) gpu/1.0
16) conda/Miniconda3-py311_23.11.0-2
Traceback (most recent call last):
File "/pscratch/sd/m/mgarten/electron_multistaging/wake-t/075_like_70_500k_particles/075_070_A_like_48_scan_Carlos_ramp_w_res_1x_8ppc_1_GeV/test_mpi.py", line 1, in <module>
Traceback (most recent call last):
File "/pscratch/sd/m/mgarten/electron_multistaging/wake-t/075_like_70_500k_particles/075_070_A_like_48_scan_Carlos_ramp_w_res_1x_8ppc_1_GeV/test_mpi.py", line 1, in <module>
Traceback (most recent call last):
File "/pscratch/sd/m/mgarten/electron_multistaging/wake-t/075_like_70_500k_particles/075_070_A_like_48_scan_Carlos_ramp_w_res_1x_8ppc_1_GeV/test_mpi.py", line 1, in <module>
from mpi4py import MPI
ImportError: libnvf.so: cannot open shared object file: No such file or directory
from mpi4py import MPI
ImportError: libnvf.so: cannot open shared object file: No such file or directory
from mpi4py import MPI
ImportError: libnvf.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/pscratch/sd/m/mgarten/electron_multistaging/wake-t/075_like_70_500k_particles/075_070_A_like_48_scan_Carlos_ramp_w_res_1x_8ppc_1_GeV/test_mpi.py", line 1, in <module>
from mpi4py import MPI
ImportError: libnvf.so: cannot open shared object file: No such file or directory
srun: error: nid001200: tasks 0-3: Exited with exit code 1
srun: Terminating StepId=26695216.0
Okay, I stole some installation instructions from our ImpactX installation on Perlmutter.
This did it for the small test script:
I loaded cray-python
since our ImpactX profile file says
# optional: for Python bindings or libEnsemble
module load cray-python/3.11.5
then I reinstalled mpi4py
python3 -m pip uninstall -qqq -y mpi4py 2>/dev/null || true
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m pip install --upgrade packaging
python3 -m pip install --upgrade wheel
python3 -m pip install --upgrade setuptools
MPICC="cc -target-accel=nvidia80 -shared" python3 -m pip install --upgrade mpi4py --no-cache-dir --no-build-isolation --no-binary mpi4py
and I added
# necessary to use CUDA-Aware MPI and run a job
export CRAY_ACCEL_TARGET=nvidia80
Now let's see if that will work in an unsupervised way and if it works then also for optimal.
EDIT: Nope :disappointed: I tried it fresh, logged out and back in, got a node, same error messages.
Okay, I think I fixed it now. I redid my whole optimas+Wake-T installation
module load cray-python/3.11.5
python3 -m pip install --user --upgrade pip
python3 -m pip install --user virtualenv
python3 -m pip cache purge
python3 -m venv /global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/
source /global/cfs/cdirs/m4272/mgarten/sw/perlmutter/gpu/venvs/optimas-wake-t/bin/activate
python3 -m pip uninstall -qqq -y mpi4py 2>/dev/null || true
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade build
python3 -m pip install --upgrade packaging
python3 -m pip install --upgrade wheel
python3 -m pip install --upgrade setuptools
python3 -m pip install --upgrade numpy
python3 -m pip install --upgrade pandas
MPICC="cc -target-accel=nvidia80 -shared" python3 -m pip install --upgrade mpi4py --no-cache-dir --no-build-isolation --no-binary mpi4py
python3 -m pip install --upgrade openpmd-api
python3 -m pip install --upgrade matplotlib
python3 -m pip install "optimas[all] @ git+https://github.com/optimas-org/optimas.git"
python3 -m pip install --upgrade wake-t
EDIT Indeed a job is running unsupervised with optimas and Wake-T. Cool :+1:
Good to hear its running. It works for me using module load python
instead of module load conda
. I think when I tried before I activated an environment that switched to the correct python.
Right, I played around with that in my previous environment, too. But I could not get it to work reliably.
Hi,
I have been trying to run optimas and Wake-T on Perlmutter (NERSC) with MPI but I could not make it work so far. Does anyone have a working setup for Perlmutter? I wonder if specific modules needed to be loaded and if certain environment variables would have to be set before installing optimas and on execution. The way specified in the docs is sufficient for single-node runs (https://optimas.readthedocs.io/en/latest/user_guide/installation_perlmutter.html) but once I add
libe_comms='mpi'
to theExploration
object and try to prepend, e.g.,srun -N 1 -n 8
on an interactive node, I am getting errors about missing MPI shared object files.When I load
mpich
which unloads a pre-loadedcray-mpich
module it asks for GCC 12 shared object files. After loading a GCC it asks for the CUDA runtime.