Closed kenfus closed 4 months ago
Currently, I'm working on the branch 512_sarus
to address this issue.
At the moment, I'm able to create and run Sarus or Singularity containers on CSCS with a more or less functional Karabo environment.
A sarus image can easily be created by just running:
module load daint-gpu
module load sarus
sarus pull ghcr.io/i4ds/karabo-pipeline:latest
For testing purpose, I start (as recommended from CSCS) an interactive SLURM job for testing:
srun -A sk05 -C gpu --pty bash
MPI In the interactive SLURM job I try to make Karabo run in a Sarus container. To replace the MPI of the current environment, I tried to do a Native MPI-Hook as follows:
sarus run --tty --mpi --mount=type=bind,source=/users/lgehrig/Karabo-Pipeline,destination=/workspace/Karabo-Pipeline ghcr.io/i4ds/karabo-pipeline:0.19.6 bash
This however fails because no MPI is found inside the container. Except it is, but in /opt/conda
. However, I'm currently not exactly sure how to solve this issue and therefore I asked Victor Holanda for support (still pending).
pytest
A Sarus container can start successfully by just leaving out the mpi-hook in the command above. I made sure to load daint-gpu
prior. Then I'm inside a running Sarus container where the environment seems almost fine. However, when I run:
pytest /opt/conda/lib/python3.9/site-packages/karabo/test
4 tests are failing currently. The reasons are:
RuntimeError: oskar_interferometer_check_init() failed with code 46 (CUDA-capable device(s) is/are busy or unavailable)
This seems odd to me, because a gpu is definitely available (nvidia-smi works inside the sarus container).FileNotFoundError: [Errno 2] No such file or directory: '/users/lgehrig/miniconda3/etc/pinocchio_params.conf'
For some reason, $CONDA_PREFIX
is the same as it was outside the container and doesn't point to /opt/conda
GPU-related issues don't occur the second time the tests are calles using pytest --lf
. It seems that maybe the gpu isn't released fast enough between tests? Not sure about that.
There are some updates on this matter:
I had a discussion with the CSCS support about the mpi-hook. Unfortunately it didn't result in a state where I was able to solve all issues. However, the support stated that we need mpich instead of openmpi and it needs to be in a standard location for the native mpi-hook for Sarus containers.
This task seems to be very difficult because our dependencies install openmpi. As far as I've seen, this is because we set openmpi in ourself as dependency instead of mpich in the following feedstock builds:
By a quick walk-through of the packages I didn't see anything which speaks for openmpi or against mpich. However, @fschramka claimed in build_base.yml that pinocchio needs the exact openmpi build which makes me unsure whether I've looked into the above mentioned repositories (links are in the feedstock meta.yaml files) properly. Thus, as long as we have openmpi in our builds, conda will always install openmpi instead of mpich.
As soon as we would have mpich as our dependency instead of openmpi, the task still remains that it has to reside in a standard-location and not in a conda environment. The only way I see this working is to install mpich in a standard location, force remove mpich from the conda environment, and reinstall mpich-dummies in the environment (according to conda-forge doc). Maybe a pre-installation of the dummies is possible to not have to force-remove the existing mpich (to test that I must have a test-script which makes use of mpi and see if it works). An example on how to install mpich in a standard location in a docker-container can be seen here.
Something which is also worth mentioning:
To me it seems like the karabo-wheel doesn't care about installing mpi-compatible dependencies. An example can be seen here:
casacore, h5py and hdf5 are all nompi wheels, despite e.g. h5py anaconda.org having openmpi and mpich wheels. Currently I'm not exactly sure what this exactly means and how to solve this issue. To me it seems like they have mpi-processes which can't be used at all because we have nompi wheels.
@Lukas113 during the time of compiling, it was the only option, more should be awailable now - take whatever MPI package you like and recompile the whole pinocchio branch with it :) Just check that you've got MPI binaries bundled - everything marked with "extern" does not hold them
Small update on my comment above.
Integrating h5py
and hdf5
enabled mpich-wheels seems to be easy by just replacing h5py
with h5py=*=mpi*
. And when we have mpich dependencies, the according mpich-wheel will be chosen.
Sadly casacore (oskar dependency) doesn't have any mpich builds, just no-mpi or open-mpi. Therefore we can't integrate an mpi enabled casacore into karabo.
So, a lot of this issue is done with PR #526 . However, I can'r really check any of the checkpoints mentioned at the beginning of the issue for several reasons. The reasons in order of the checkpoints are as follows:
mpi-hook is now enabled with PR #526 .
@kenfus Therefore, I suggest we close this issue, and reopen a new one for the second point. Do you agree?
If yes, I suggest that you write the issue for the second check-point, because you're the person which has already done some work with dask and parallelization on CSCS, and therefore can write a proper issue.
https://user.cscs.ch/tools/containers/sarus/ https://sarus.readthedocs.io/en/latest/quickstart/quickstart.html https://sarus.readthedocs.io/en/latest/user/custom-cuda-images.html https://sarus.readthedocs.io/en/latest/user/abi_compatibility.html