nextsimhub / nextsimdg

neXtSIM_DG : next generation sea-ice model with DG
https://nextsim-dg.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
10 stars 13 forks source link

Core dump for parallel run with RectGrid #490

Closed MarionBWeinzierl closed 5 months ago

MarionBWeinzierl commented 5 months ago

The script run/run_simple_example.sh runs through in serial mode.

However, if built with ENABLE_MPI=ON, the following error message occurs:

terminate called after throwing an instance of 'netCDF::exceptions::NcException'
what():  NetCDF: Parallel operation on file opened for non-parallel access
file: /home/nextsimdg/core/src/ParallelNetcdfFile.cpp  line:31

The same does not happen with, for example, the column example, which uses the ParaGrid as opposed to the RectGrid. This is kind of the other way around as expected, as MPI functionality of implemented for RectGrid (#331) but not for ParaGrid.

TomMelt commented 5 months ago

~So we need to find a better solution for this (my suggestion is to add the parallel metadata filename e.g., partition.nc to the config) but~ here are the steps:

  1. run the domain decomposition tool decomp
    mpirun -n 1 decomp --grid init_rect30x30.nc
  2. rename the generated metadata
    mv partition_metadata_1.nc partition.nc
  3. run the simple example using mpirun
    mpirun -n 1 ./nextsim --config-file config_simple_example.cfg

disclaimer this works for me. Let me know if it also runs on your machine :+1:

Here is the current config (config_simple_example.cfg):

[model]
init_file = init_rect30x30.nc
start = 2010-04-01T00:00:00Z
[...]

~Here is my proposed solution:~ (see below for correct config)

[model]
init_file = init_rect30x30.nc
partition_file = partition.nc
start = 2010-04-01T00:00:00Z
[...]
TomMelt commented 5 months ago

scrap that. I did some digging and I can see it is already an option in the config. It is set to a default value here:

https://github.com/nextsimhub/nextsimdg/blob/27fed3768d68485c8194d8d09cbe5f282697e6e1/core/src/Model.cpp#L121

But users can override using this key:

https://github.com/nextsimhub/nextsimdg/blob/27fed3768d68485c8194d8d09cbe5f282697e6e1/core/src/Model.cpp#L41

So we just need to change the config to this:

$ cat config_simple_example.cfg 
[model]
init_file = init_rect30x30.nc
partition_file=partition_metadata_1.nc
start = 2010-04-01T00:00:00Z
stop = 2010-04-02T00:00:00Z
time_step = P0-0T0:10:0
missing_value = -3.40282346638e+38

We can maybe decided what the best option is for the partition metadata file. I just realised if we code it to the output of decomp then it will change depending on the number of procs. So perhaps it's a good idea to leave it as partition.nc

TomMelt commented 5 months ago

You can also pass command line option as well e.g.,

mpirun -n 1 ./nextsim --config-file config_simple_example.cfg --model.partition_file partition_metadata_1.nc
MarionBWeinzierl commented 5 months ago

These steps do unfortunately not work for me. Specifically, I receive the same error when I try to run the decomp tool, which hints to the problem being in the netcdf installation or so. I tried installing hdf5 with parallel support and netcdf from source, but without a success in changing the outcome. The same happens when using the Dockerfile.

MarionBWeinzierl commented 5 months ago

It is a call to nc_open_par which throws the error, so it seems that the netcdf file which is supposed to be read in is in the wrong format? And that would mean, as it happens with the decomp tool which reads in a python-generated file that this might be a problem with the python3-netcdf4 installation? No, probably not, it does not say anything about wrong format, it says it is opened for non-parallel access, in the open function?

monsieuralok commented 5 months ago

I am going to run parallel nextsim using MPI. But, I have not find "decomp" script or binary. I have build it using following: cmake .. -DCMAKE_C_COMPILER=mpiicc -DCMAKE_CXX_COMPILER=mpiicc -DCMAKE_BUILD_TYPE=Release -DENABLE_MPI=ON make

Is there something wrong?

MarionBWeinzierl commented 5 months ago

Hi, no, sorry, the decomp tool is in a separate repository: https://github.com/nextsimhub/domain_decomp .

I am going to write up some docs which make that explicit, and am also thinking about putting an example partition file into the run directory to be used with the simple example.

MarionBWeinzierl commented 5 months ago

OK, @TomMelt helped me sort out this problem, which was basically due to a combination of of compiler and netcdf versions (funnily enough, it was working when I ran two different cmake commands one after the other without deleting the build directory in between).

@TomMelt 's spack install using all the right versions sorted this out, copying it here for info: spack install boost@1.80.0+log+program_options cmake@3.26.3 eigen@3.4.0 netcdf-c@4.9.2+mpi+parallel-netcdf netcdf-cxx4@4.3.1 openmpi@4.1.2

In #491 I will add information on running with MPI to the docs.