Closed MarionBWeinzierl closed 5 months ago
~So we need to find a better solution for this (my suggestion is to add the parallel metadata filename e.g., partition.nc
to the config) but~ here are the steps:
decomp
mpirun -n 1 decomp --grid init_rect30x30.nc
mv partition_metadata_1.nc partition.nc
mpirun
mpirun -n 1 ./nextsim --config-file config_simple_example.cfg
disclaimer this works for me. Let me know if it also runs on your machine :+1:
Here is the current config (config_simple_example.cfg
):
[model]
init_file = init_rect30x30.nc
start = 2010-04-01T00:00:00Z
[...]
~Here is my proposed solution:~ (see below for correct config)
[model]
init_file = init_rect30x30.nc
partition_file = partition.nc
start = 2010-04-01T00:00:00Z
[...]
scrap that. I did some digging and I can see it is already an option in the config. It is set to a default value here:
But users can override using this key:
So we just need to change the config to this:
$ cat config_simple_example.cfg
[model]
init_file = init_rect30x30.nc
partition_file=partition_metadata_1.nc
start = 2010-04-01T00:00:00Z
stop = 2010-04-02T00:00:00Z
time_step = P0-0T0:10:0
missing_value = -3.40282346638e+38
We can maybe decided what the best option is for the partition metadata file. I just realised if we code it to the output of decomp then it will change depending on the number of procs. So perhaps it's a good idea to leave it as partition.nc
You can also pass command line option as well e.g.,
mpirun -n 1 ./nextsim --config-file config_simple_example.cfg --model.partition_file partition_metadata_1.nc
These steps do unfortunately not work for me. Specifically, I receive the same error when I try to run the decomp
tool, which hints to the problem being in the netcdf installation or so. I tried installing hdf5 with parallel support and netcdf from source, but without a success in changing the outcome. The same happens when using the Dockerfile.
It is a call to nc_open_par
which throws the error, so it seems that the netcdf file which is supposed to be read in is in the wrong format? And that would mean, as it happens with the decomp
tool which reads in a python-generated file that this might be a problem with the python3-netcdf4
installation? No, probably not, it does not say anything about wrong format, it says it is opened for non-parallel access, in the open function?
I am going to run parallel nextsim using MPI. But, I have not find "decomp" script or binary. I have build it using following: cmake .. -DCMAKE_C_COMPILER=mpiicc -DCMAKE_CXX_COMPILER=mpiicc -DCMAKE_BUILD_TYPE=Release -DENABLE_MPI=ON make
Is there something wrong?
Hi, no, sorry, the decomp tool is in a separate repository: https://github.com/nextsimhub/domain_decomp .
I am going to write up some docs which make that explicit, and am also thinking about putting an example partition file into the run directory to be used with the simple example.
OK, @TomMelt helped me sort out this problem, which was basically due to a combination of of compiler and netcdf versions (funnily enough, it was working when I ran two different cmake commands one after the other without deleting the build directory in between).
@TomMelt 's spack install using all the right versions sorted this out, copying it here for info:
spack install boost@1.80.0+log+program_options cmake@3.26.3 eigen@3.4.0 netcdf-c@4.9.2+mpi+parallel-netcdf netcdf-cxx4@4.3.1 openmpi@4.1.2
In #491 I will add information on running with MPI to the docs.
The script run/run_simple_example.sh runs through in serial mode.
However, if built with ENABLE_MPI=ON, the following error message occurs:
The same does not happen with, for example, the
column
example, which uses the ParaGrid as opposed to the RectGrid. This is kind of the other way around as expected, as MPI functionality of implemented for RectGrid (#331) but not for ParaGrid.