underworldcode / underworld2

underworld2: A parallel, particle-in-cell, finite element code for Geodynamics.
http://www.underworldcode.org/
Other
174 stars 59 forks source link

HDF5: infinite loop error on Setonix (using singularity/3.8.6-mpi) #668

Open gduclaux opened 1 year ago

gduclaux commented 1 year ago

Hello guys,

I've installed UW2 latest container on Setonix (Pawsey Center) using Singularity and it went quite smoothly đź‘Ť

There are 2 versions of Singularity available on Setonix: 1) singularity/3.8.6-nompiet 2) singularity/3.8.6-mpi

I first ran a test job in serial using the singularity/3.8.6-nompi module and all went well.

But, when I try to run the same test job in parallel using the singularity/3.8.6-mpi module I get an error message (related to hdf5 AFAICT) that takes place when the code tries to write the step 0 outputs (either on one or on multiple ranks).

Below it the stdout returned when running singularity/3.8.6-mpi version on a single core:

loaded rc file /opt/venv/lib/python3.10/site-packages/underworld/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
    Global element size: 256x256
    Local offset of rank 0: 0x0
    Local range of rank 0: 256x256
In func WeightsCalculator_CalculateAll(): for swarm "UTTBHS5P__swarm"
    done 33% (21846 cells)...
    done 67% (43691 cells)...
    done 100% (65536 cells)...
WeightsCalculator_CalculateAll(): finished update of weights for swarm "UTTBHS5P__swarm"
/opt/venv/lib/python3.10/site-packages/underworld/UWGeodynamics/_model.py:1582: UserWarning: Skipping the steady state calculation: No diffusivity variable defined on Model
  warnings.warn("Skipping the steady state calculation: No diffusivity variable defined on Model")
Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14dcd6441c4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14dcd5df3684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14dcd6472775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14dcd64ae1c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14dcd6453625]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(+0x30c6bd) [0x14dcce6646bd]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5FD_open+0x13c) [0x14dcce457f1c]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5F_open+0x494) [0x14dcce449b94]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5VL__native_file_create+0x1a) [0x14dcce62e2ba]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5VL_file_create+0xcd) [0x14dcce6192cd]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(H5Fcreate+0x12c) [0x14dcce43d5bc]
/opt/venv/lib/python3.10/site-packages/h5py/defs.cpython-310-x86_64-linux-gnu.so(+0x66e02) [0x14dcce3bee02]
/opt/venv/lib/python3.10/site-packages/h5py/h5f.cpython-310-x86_64-linux-gnu.so(+0x4c7bf) [0x14dcccf377bf]
/opt/venv/bin/python3(+0x15c8de) [0x5628ca12a8de]
/opt/venv/lib/python3.10/site-packages/h5py/_objects.cpython-310-x86_64-linux-gnu.so(+0xc13b) [0x14dcdb6ce13b]
/opt/venv/bin/python3(_PyObject_MakeTpCall+0x25b) [0x5628ca1213bb]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x73b3) [0x5628ca11a583]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x1a31) [0x5628ca114c01]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
/opt/venv/bin/python3(_PyObject_FastCallDictTstate+0x16d) [0x5628ca1205fd]
/opt/venv/bin/python3(+0x166d74) [0x5628ca134d74]
/opt/venv/bin/python3(+0x15376b) [0x5628ca12176b]
/opt/venv/bin/python3(PyObject_Call+0xbb) [0x5628ca13975b]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x2955) [0x5628ca115b25]
/opt/venv/bin/python3(+0x16ad71) [0x5628ca138d71]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x26c5) [0x5628ca115895]
/opt/venv/bin/python3(+0x16ab11) [0x5628ca138b11]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x1a31) [0x5628ca114c01]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
/opt/venv/bin/python3(_PyEval_EvalFrameDefault+0x816) [0x5628ca1139e6]
/opt/venv/bin/python3(_PyFunction_Vectorcall+0x7c) [0x5628ca12b12c]
MPICH ERROR [Rank 0] [job id 3488434.0] [Thu Jul 27 06:23:51 2023] [nid002309] - Abort(1): Internal error

HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
srun: error: nid002309: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=3488434.0

I suspect this is a singularity problem and not an UW2 problem... are you familiar with this type of error? I can report with the Pawsey center Helpdesk if you confirm this is a singularity problem.

Cheers

Guillaume

gduclaux commented 1 year ago

After digging further into the Pawsey doco I found this https://pawsey.org.au/technical-newsletter/ (see 13 March 2023 entry):

Parallel IO within Containers Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:

Example of error message

Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]

Currently it is unclear exactly what is causing this issue. Investigations are ongoing.

Workaround:

There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built “bare-metal” on Setonix itself (see How to Install Software).

I guess I'm about to install UW2 from source on Setonix... Would you have any step-by-step recipe at hands for this specific Cray machine? I found the one you put together for Magnus a few years back.

julesghub commented 1 year ago

Hey Gilly,

Yeah this is an on going issue we have raise with setonix on several occasions. For now we are stuck with build bare metal builds on setonix. I will upload some instructions for it later today.

julesghub commented 1 year ago

Hey Gilly, To update you on this. Setonix's permission setup means I can't install things for a project I'm not a user in. So I'm trying to put together bare metal instructions for you that make things as smooth as possible from your end. I'm testing some instructions I have put together this afternoon and if things work out I'll send them though later.

gduclaux commented 1 year ago

Hi Jules,

I have been off grid for the past couple weeks and back in the office now. If you have a recipe at hand for the install I would love to give it to! Cheers Gilly

julesghub commented 1 year ago

Hi Gilly,

https://support.pawsey.org.au/documentation/display/US/Containers+changes I'm going to rebuild the docker image and try singularity again on setonix. I'll keep you posted. cheers, J