i4Ds / Karabo-Pipeline

The Karabo Pipeline can be used as Digital Twin for SKA
https://i4ds.github.io/Karabo-Pipeline/
MIT License
11 stars 4 forks source link

Docker Image is usable with Sarus on the CSCS cluster #512

Closed kenfus closed 4 months ago

kenfus commented 1 year ago

https://user.cscs.ch/tools/containers/sarus/ https://sarus.readthedocs.io/en/latest/quickstart/quickstart.html https://sarus.readthedocs.io/en/latest/user/custom-cuda-images.html https://sarus.readthedocs.io/en/latest/user/abi_compatibility.html

Lukas113 commented 1 year ago

Currently, I'm working on the branch 512_sarus to address this issue.

At the moment, I'm able to create and run Sarus or Singularity containers on CSCS with a more or less functional Karabo environment.

A sarus image can easily be created by just running:

module load daint-gpu
module load sarus
sarus pull ghcr.io/i4ds/karabo-pipeline:latest

For testing purpose, I start (as recommended from CSCS) an interactive SLURM job for testing:

srun -A sk05 -C gpu --pty bash

MPI In the interactive SLURM job I try to make Karabo run in a Sarus container. To replace the MPI of the current environment, I tried to do a Native MPI-Hook as follows:

sarus run --tty --mpi --mount=type=bind,source=/users/lgehrig/Karabo-Pipeline,destination=/workspace/Karabo-Pipeline ghcr.io/i4ds/karabo-pipeline:0.19.6 bash

This however fails because no MPI is found inside the container. Except it is, but in /opt/conda. However, I'm currently not exactly sure how to solve this issue and therefore I asked Victor Holanda for support (still pending).

image

pytest

A Sarus container can start successfully by just leaving out the mpi-hook in the command above. I made sure to load daint-gpu prior. Then I'm inside a running Sarus container where the environment seems almost fine. However, when I run:

pytest /opt/conda/lib/python3.9/site-packages/karabo/test

image

4 tests are failing currently. The reasons are:

Lukas113 commented 1 year ago

GPU-related issues don't occur the second time the tests are calles using pytest --lf. It seems that maybe the gpu isn't released fast enough between tests? Not sure about that.

Lukas113 commented 1 year ago

There are some updates on this matter:

I had a discussion with the CSCS support about the mpi-hook. Unfortunately it didn't result in a state where I was able to solve all issues. However, the support stated that we need mpich instead of openmpi and it needs to be in a standard location for the native mpi-hook for Sarus containers.

This task seems to be very difficult because our dependencies install openmpi. As far as I've seen, this is because we set openmpi in ourself as dependency instead of mpich in the following feedstock builds:

By a quick walk-through of the packages I didn't see anything which speaks for openmpi or against mpich. However, @fschramka claimed in build_base.yml that pinocchio needs the exact openmpi build which makes me unsure whether I've looked into the above mentioned repositories (links are in the feedstock meta.yaml files) properly. Thus, as long as we have openmpi in our builds, conda will always install openmpi instead of mpich.

As soon as we would have mpich as our dependency instead of openmpi, the task still remains that it has to reside in a standard-location and not in a conda environment. The only way I see this working is to install mpich in a standard location, force remove mpich from the conda environment, and reinstall mpich-dummies in the environment (according to conda-forge doc). Maybe a pre-installation of the dummies is possible to not have to force-remove the existing mpich (to test that I must have a test-script which makes use of mpi and see if it works). An example on how to install mpich in a standard location in a docker-container can be seen here.

Lukas113 commented 1 year ago

Something which is also worth mentioning:

To me it seems like the karabo-wheel doesn't care about installing mpi-compatible dependencies. An example can be seen here:

image

casacore, h5py and hdf5 are all nompi wheels, despite e.g. h5py anaconda.org having openmpi and mpich wheels. Currently I'm not exactly sure what this exactly means and how to solve this issue. To me it seems like they have mpi-processes which can't be used at all because we have nompi wheels.

fschramka commented 1 year ago

@Lukas113 during the time of compiling, it was the only option, more should be awailable now - take whatever MPI package you like and recompile the whole pinocchio branch with it :) Just check that you've got MPI binaries bundled - everything marked with "extern" does not hold them

Lukas113 commented 1 year ago

Small update on my comment above.

Integrating h5py and hdf5 enabled mpich-wheels seems to be easy by just replacing h5py with h5py=*=mpi*. And when we have mpich dependencies, the according mpich-wheel will be chosen.

Sadly casacore (oskar dependency) doesn't have any mpich builds, just no-mpi or open-mpi. Therefore we can't integrate an mpi enabled casacore into karabo.

Lukas113 commented 11 months ago

So, a lot of this issue is done with PR #526 . However, I can'r really check any of the checkpoints mentioned at the beginning of the issue for several reasons. The reasons in order of the checkpoints are as follows:

Lukas113 commented 10 months ago

mpi-hook is now enabled with PR #526 .

@kenfus Therefore, I suggest we close this issue, and reopen a new one for the second point. Do you agree?

If yes, I suggest that you write the issue for the second check-point, because you're the person which has already done some work with dask and parallelization on CSCS, and therefore can write a proper issue.