mom-ocean / MOM5

The Modular Ocean Model
https://mom-ocean.github.io/
GNU Lesser General Public License v3.0
82 stars 95 forks source link

GFDL-ESM2M piControl does not run #377

Open Jete90 opened 1 year ago

Jete90 commented 1 year ago

Hello,

I downloaded the MOM5 code to the WHOI supercomputer.

After compiling GFDL-ESM2M, I tried to run it. Unfortunately, I quickly ran into segmentation faults when running it.

I attached the error message below.

It might be due to the modules/compiler versions that I am using.

This is what my environment looks like:

source $MODULESHOME/init/csh module load intel module load netcdf/intel/4.6.1 module load openmpi/intel

setenv mpirunCommand "mpirun -np"

Kind regards

Jens


ERROR MESSAGE


[...]

LND(ATMOCNLND)= 0.153673308874230 0.153673308874230 0.153673308871445 NOTE from PE 0: xgrid_mod: reading exchange grid information from mosaic grid file NOTE from load_xgrid(xgrid_mod): field 'scale' exist in the file INPUT/land_mos aicXocean_mosaic.nc, this field will be read and the exchange grid cell area wi ll be multiplied by scale Checked data is array of constant 1 LND(LNDOCN)= 0.703873657789463 0.703873657789466 0.703873657789463 OCN(LNDOCN)= 0.703873657789467 0.703873657789463 0.703873657789466

FATAL from PE 31: ==>Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list 1 0

FATAL from PE 32: ==>Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list 1 0

[.....]

fms_ESM2M.x 0000000000452D04 Unknown Unknown Unknown fms_ESM2M.x 000000000045BD03 Unknown Unknown Unknown fms_ESM2M.x 00000000004556BF Unknown Unknown Unknown fms_ESM2M.x 000000000040E19E Unknown Unknown Unknown libc-2.17.so 00002AAAAC544555 __libc_start_main Unknown Unknown fms_ESM2M.x 000000000040E0A9 Unknown Unknown Unknown

MPI_ABORT was invoked on rank 30 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source fms_ESM2M.x 0000000002A8FDEE forsignal_handl Unknown Unknown libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source fms_ESM2M.x 0000000002A8FDEE forsignal_handl Unknown Unknown libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source fms_ESM2M.x 0000000002A8FDEE forsignal_handl Unknown Unknown libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source fms_ESM2M.x 0000000002A8FDEE forsignal_handl Unknown Unknown libpthread-2.17.s 00002AAAAC315630 Unknown Unknown Unknown libpthread-2.17.s 00002AAAAC312573 pthread_spin_lock Unknown Unknown [pn030:263631] Process received signal [pn030:263631] Signal: Segmentation fault (11) [pn030:263631] Signal code: Address not mapped (1) [pn030:263631] Failing at address: 0x28 [pn030:263631] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaaabe1d630] [pn030:263631] [ 1] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(+0xb2723)[0x2aaab86c1723] [pn030:263631] [ 2] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(pmix_ptl_base_recv_handler+0x579)[0x2aaab86c24a9] [pn030:263631] [ 3] /vortexfs1/apps/openmpi-3.0.1-intel/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xa09)[0x2aaaab021829] [pn030:263631] [ 4] /vortexfs1/apps/openmpi-3.0.1-intel/lib/openmpi/mca_pmix_pmix2x.so(+0x9d0f2)[0x2aaab86ac0f2] [pn030:263631] [ 5] /lib64/libpthread.so.0(+0x7ea5)[0x2aaaabe15ea5] [pn030:263631] [ 6] /lib64/libc.so.6(clone+0x6d)[0x2aaaac128b0d] [pn030:263631] End of error message Segmentation fault ERROR: Model failed to run to completion

russfiedler commented 1 year ago

@Jete90 This bug originates from using an old netCDF version as documented here https://github.com/NOAA-GFDL/CM4/issues/11 and https://github.com/NOAA-GFDL/icebergs/issues/44

You'll need to update to 4.7.3 or later.

wienkers commented 1 year ago

As a follow-up to Jens' question: Does this mean that many of the .res.nc included in the ESM2M piControl test setup provided are corrupt ? I have netCDF v4.7.4, and regardless of compiling with the netCDF4 flag on/off, I still receive the above error that Jens runs into. Thank you in advance for your help! Aaron

russfiedler commented 1 year ago

@wienkers The bug was specific to the iceberg restarts as far as I remember. It's quite possible there are other problems with non ocean restarts.

wienkers commented 1 year ago

Thank you for the quick reply @russfiedler After a bit more digging, this seems to no longer be arising from the netCDF bug. The error:

Error from coupler_types_mod (CT_spawn_1d_3d): Disordered k-dimension index bound list    1    0   

points back to flux_exchange_init, where

call mpp_get_compute_domain( Ice%domain, is, ie, js, je )
    kd = size(Ice%ice_mask,3)
    call coupler_type_copy(ex_gas_fields_ice, Ice%ocean_fields, is, ie, js, je, kd,     &
         'ice_flux', Ice%axes, Time, suffix = '_ice›)

At run-time, kd = 6 on the Ice/Atm processes (as it should for num_part = 6 in the input.nml), but kd = 0 on the Ocean processes (which then each throw the error). This block of code is evaluated on all processes; however, it seems like the call to subroutine ice_model_init in coupler_init which allocates Ice%ice_mask only occurs for the Ice processes. So the size information about Ice%ice_mask needed then in the above block of code just becomes 0.

russfiedler commented 1 year ago

@wienkers Ah, yes, I vaguely remember that was a possibility and that it should only be evaluated on Ice processors. I can't remember if it's sufficient to encase the code in an if(Ice%pe) then...endif block. It should be.