nv-legate / cunumeric

An Aspiring Drop-In Replacement for NumPy at Scale
https://docs.nvidia.com/cunumeric/24.06/
Apache License 2.0
622 stars 70 forks source link

installation of OpenMPI from Conda results in incompatibilities with system MPI on many supercomputers #630

Closed rohany closed 1 year ago

rohany commented 2 years ago

Installing OpenMPI (or any packages that override system compilers https://github.com/nv-legate/cunumeric/issues/629) result in various hard to track down build issues on supercomputers like Lassen and PizDaint where the software stack is carefully managed. Additionally, on these systems, the CMake build does not properly reference the MPI compilers (mpicc and mpicxx) when building Legion's GASNet or legate/cuNumeric, resulting in various errors like "cannot compile MPI programs" during configures, or <mpi.h> not found.

It seems like we should have separate conda files for this kind of use case, where more important packages (compilers, MPI etc.) are provided by the platform.

manopapad commented 2 years ago

It seems like we should have separate conda files for this kind of use case, where more important packages (compilers, MPI etc.) are provided by the platform.

https://github.com/nv-legate/legate.core/pull/367 should be taking care of this.

Additionally, on these systems, the CMake build does not properly reference the MPI compilers (mpicc and mpicxx) when building Legion's GASNet or legate/cuNumeric

It would be good to understand what is going wrong here; the embedded gasnet build should be using mpicc directly.

I would actually expect legate.core to also be using mpicc (since we're compiling code that uses MPI, core/comm/comm.cc in particular), but weirdly I don't see mpicc being used at all, not even linking against -lmpi. Out of curiosity, @jjwilke @trxcllnt do you know how this is working?

manopapad commented 1 year ago

All the mpicc wrapper does is call the underlying compiler with some flags specifically for code that uses MPI. It looks like cmake doesn't use mpicc directly, but instead add the required flags to its compiler invocations directly.

In any case, it looks like this works as expected on clusters if we don't install the "compilers" packages from conda.