nextsimhub / nextsimdg

neXtSIM_DG : next generation sea-ice model with DG
https://nextsim-dg.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
10 stars 13 forks source link

single column test fails with MPI build on #534

Open TomMelt opened 2 months ago

TomMelt commented 2 months ago

If you build nextsim with MPI=ON, and run the ThermoIntegration_test.py integration test it will produce the following error (on Linux):

Error Output ``` $ python ThermoIntegration_test.py HDF5-DIAG: Error detected in HDF5 (1.8.21) thread 0: #000: /tmp/melt/spack-stage/spack-stage-hdf5-1.8.21-dfgfamgva4yyx72fmjpldk33s3e6uap6/spack-src/src/H5A.c line 1638 in H5Aexists(): not a location major: Invalid arguments to routine minor: Inappropriate type #001: /tmp/melt/spack-stage/spack-stage-hdf5-1.8.21-dfgfamgva4yyx72fmjpldk33s3e6uap6/spack-src/src/H5Gloc.c line 193 in H5G_loc(): invalid group ID major: Invalid arguments to routine minor: Bad value HDF5-DIAG: Error detected in HDF5 (1.8.21) thread 0: #000: /tmp/melt/spack-stage/spack-stage-hdf5-1.8.21-dfgfamgva4yyx72fmjpldk33s3e6uap6/spack-src/src/H5Adeprec.c line 176 in H5Acreate1(): not a location major: Invalid arguments to routine minor: Inappropriate type #001: /tmp/melt/spack-stage/spack-stage-hdf5-1.8.21-dfgfamgva4yyx72fmjpldk33s3e6uap6/spack-src/src/H5Gloc.c line 193 in H5G_loc(): invalid group ID major: Invalid arguments to routine minor: Bad value terminate called after throwing an instance of 'netCDF::exceptions::NcFileMeta' what(): NetCDF: Can't add HDF5 file metadata file: ncFile.cpp line:33 [lenny:23295] *** Process received signal *** [lenny:23295] Signal: Aborted (6) [lenny:23295] Signal code: (-6) [lenny:23295] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7d9550bec520] [lenny:23295] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7d9550c409fc] [lenny:23295] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7d9550bec476] [lenny:23295] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7d9550bd27f3] [lenny:23295] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7d9550e95b9e] [lenny:23295] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7d9550ea120c] [lenny:23295] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7d9550ea1277] [lenny:23295] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7d9550ea14d8] [lenny:23295] [ 8] /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-11.4.0/netcdf-cxx4-4.3.1-kwvtkd7klice2xvellynedzj63eraxly/lib/libnetcdf_c++4.so.1(+0x26a2a)[0x7d9550aa3a2a] [lenny:23295] [ 9] /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-11.4.0/netcdf-cxx4-4.3.1-kwvtkd7klice2xvellynedzj63eraxly/lib/libnetcdf_c++4.so.1(_ZN6netCDF6NcFile5closeEv+0x33)[0x7d9550aab123] [lenny:23295] [10] /home/melt/sync/cambridge/projects/current/sasip/nextsimdg/build-mpi/libnextsimlib.so(_ZN7Nextsim10ParaGridIO5closeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x4e)[0x7d955276f43e] [lenny:23295] [11] /home/melt/sync/cambridge/projects/current/sasip/nextsimdg/build-mpi/libnextsimlib.so(_ZN7Nextsim10ParaGridIO13closeAllFilesEv+0x9f)[0x7d955276f50d] [lenny:23295] [12] /lib/x86_64-linux-gnu/libc.so.6(+0x45495)[0x7d9550bef495] [lenny:23295] [13] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x7d9550bef610] [lenny:23295] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x29d97)[0x7d9550bd3d97] [lenny:23295] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7d9550bd3e40] [lenny:23295] [16] ../nextsim(+0xc5f5)[0x5a2b7ab945f5] [lenny:23295] *** End of error message *** Aborted (core dumped) E ====================================================================== ERROR: setUpClass (__main__.SingleColumnThermo) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/melt/sync/cambridge/projects/current/sasip/nextsimdg/test/ThermoIntegration_test.py", line 38, in setUpClass subprocess.run(cls.executable + " --config-file " + cls.config_file, shell=True, check=True) File "/home/melt/miniconda3/envs/nextsim/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '../nextsim --config-file ThermoIntegration.cfg' returned non-zero exit status 134. ---------------------------------------------------------------------- Ran 0 tests in 45.091s FAILED (errors=1) ```

~I am investigating why exactly this happens.~

The test fails because it relies on Paragrid which has not yet been parallelized. Therefore the single column test fails when built with MPI. This should be fixed when the MPI parallelization of Paragrid (#495) is finished.

The single column test runs when MPI=OFF