Open gduclaux opened 1 year ago
After digging further into the Pawsey doco I found this https://pawsey.org.au/technical-newsletter/ (see 13 March 2023 entry):
Parallel IO within Containers Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:
Example of error message
Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]
Currently it is unclear exactly what is causing this issue. Investigations are ongoing.
Workaround:
There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built “bare-metal” on Setonix itself (see How to Install Software).
I guess I'm about to install UW2 from source on Setonix... Would you have any step-by-step recipe at hands for this specific Cray machine? I found the one you put together for Magnus a few years back.
Hey Gilly,
Yeah this is an on going issue we have raise with setonix on several occasions. For now we are stuck with build bare metal builds on setonix. I will upload some instructions for it later today.
Hey Gilly, To update you on this. Setonix's permission setup means I can't install things for a project I'm not a user in. So I'm trying to put together bare metal instructions for you that make things as smooth as possible from your end. I'm testing some instructions I have put together this afternoon and if things work out I'll send them though later.
Hi Jules,
I have been off grid for the past couple weeks and back in the office now. If you have a recipe at hand for the install I would love to give it to! Cheers Gilly
Hi Gilly,
https://support.pawsey.org.au/documentation/display/US/Containers+changes I'm going to rebuild the docker image and try singularity again on setonix. I'll keep you posted. cheers, J
Hello guys,
I've installed UW2 latest container on Setonix (Pawsey Center) using Singularity and it went quite smoothly đź‘Ť
There are 2 versions of Singularity available on Setonix: 1)
singularity/3.8.6-nompi
et 2)singularity/3.8.6-mpi
I first ran a test job in serial using the
singularity/3.8.6-nompi
module and all went well.But, when I try to run the same test job in parallel using the
singularity/3.8.6-mpi
module I get an error message (related to hdf5 AFAICT) that takes place when the code tries to write the step 0 outputs (either on one or on multiple ranks).Below it the stdout returned when running
singularity/3.8.6-mpi
version on a single core:I suspect this is a
singularity
problem and not an UW2 problem... are you familiar with this type of error? I can report with the Pawsey center Helpdesk if you confirm this is a singularity problem.Cheers
Guillaume