grantcurell / SIMULATeQCD

SIMULATeQCD is a multi-GPU Lattice QCD framework that makes it simple and easy for physicists to implement lattice QCD formulas while still providing the best possible performance.
https://latticeqcd.github.io/SIMULATeQCD/
GNU General Public License v3.0
0 stars 0 forks source link

running on supercomputer: general #3

Open clarkedavida opened 1 year ago

clarkedavida commented 1 year ago

We are trying a couple strategies to run on a supercomputer.

Strategy 1: Compile on supercomputer. I am tabulating here which clusters start with podman and which don't. It would be ideal to eliminate having to contact a sysadmin to run on supercomputers.

PODMAN: notchpeak, houston

NO PODMAN: bielefeld, juwels

Strategy 2: Compile locally, then run on supercomputer.

This doesn't seem to work in the general case. For example compiling GenerateQuenched locally, then running on bielefeld, I get the error

mpiexec --oversubscribe -np 1 ./GenerateQuenched GenerateQuenched.param 
./GenerateQuenched: error while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

This mpiexec --oversubscribe -np 1 command is what I do to run interactively, which works when I compile the traditional way.

clarkedavida commented 1 year ago

If the binary needs to access some library, then I think strategy 2 will not work, right?

clarkedavida commented 1 year ago

In case that getting this to work on supercomputers is a significant undertaking that neither of us has time for:

Maybe we can recommend the container_build for local use only, which is still beneficial to some of us. Then for the supercomputer build we can use the already existing instructions, or possibly streamline that with bash scripts? For a couple clusters, I already have such scripts.

Just a thought.

grantcurell commented 1 year ago

It's telling you that you compiled it against libmpi_cxx.so and that the supercomputer you're running against doesn't have that shared library in path. That tells me that the supercomputer in question is using a different MPI library and you need to compile against that MPI library or you need to make sure that libmpi_cxx.so is in the load library path