mom-ocean / MOM6

Modular Ocean Model
Other
185 stars 231 forks source link

Baltic crashes and OM4_05 hangs in prod mode (-O3) #836

Closed nikizadehgfdl closed 6 years ago

nikizadehgfdl commented 6 years ago

OM4_05 example hangs when executable is made in prod mode (-O3) . It runs OK with repro (-O2) and debug(-O0). This is with dev/gfdl top of the branch (MOM6 commit 773902a7, SIS2 commit fbf4ab59946 , warsaw_201803 for the rest).

Also, Baltic test case crashes in prod mode with the following traceback, but runs OK in repro mode.

FATAL from PE     0: The linear system is singular !

  A=                     NaN                     NaN                     NaN

forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
fms_MOM6_SIS2_com  0000000001741956  mpp_mod_mp_mpp_er          69  mpp_util_mpi.inc
fms_MOM6_SIS2_com  00000000014B0CC3  regrid_solvers_mp          58  regrid_solvers.F90
fms_MOM6_SIS2_com  00000000013CDF8C  regrid_edge_value         555  regrid_edge_values.F90
fms_MOM6_SIS2_com  0000000000AA1202  mom_remapping_mp_         402  MOM_remapping.F90
fms_MOM6_SIS2_com  0000000000A9F5AE  mom_remapping_mp_         207  MOM_remapping.F90
fms_MOM6_SIS2_com  0000000001263E21  mom_diag_remap_mp         339  MOM_diag_remap.F90
fms_MOM6_SIS2_com  0000000000A9307C  mom_diag_mediator         991  MOM_diag_mediator.F90
fms_MOM6_SIS2_com  00000000010FD35E  mom_diagnostics_m         280  MOM_diagnostics.F90
fms_MOM6_SIS2_com  00000000007A7556  mom_mp_step_mom_          767  MOM.F90
fms_MOM6_SIS2_com  0000000000751A58  ocean_model_mod_m         556  ocean_model_MOM.F90
fms_MOM6_SIS2_com  000000000040B5EB  MAIN__                   1021  coupler_main.F90

Note that prod mode is the current settings for all production experiments and this was not an issue with dev/gfdl/2018.04.11 tag or before.

nikizadehgfdl commented 6 years ago

Both of the above anomalies kick in with the waves_update commit to dev/gfdl. Both models work fine before that update.

In these experiments USE_WAVES = False, but there are updates in src/parameterizations/vertical/MOM_vert_friction.F90 that might cause these, consistent with what @raymenzel has observed using a debugger.

@breichl, could you check to see what in these updates might not be liked by the compiler optimization routines (loop reordering, ...)? We'll dig in too.

nikizadehgfdl commented 6 years ago

The apparent hang of OM4_05 is due to too many u&v truncation errors being written to file that bogs down the cores. The model actually comes down because of too many extreme values:

WARNING from PE   373: Extreme surface sfc_state detected: i=  35 j=  16 x=  -9.355 y=  73.607 D= 2.5419E+03 SSH= 6.1646E+02 SST=-5.8620E-01 SSS= 3.4571E+01 U-=        NaN U+=        NaN V-=        NaN V+=        NaN
WARNING from PE   371: Extreme surface sfc_state detected: i=   5 j=  21 x= -63.757 y=  73.822 D= 1.0036E+03 SSH= 3.7280E+03 SST=-1.3445E+00 SSS= 3.3268E+01 U-= 0.0000E+00 U+=-4.2807E+00 V-= 5.3146E+00 V+=-5.2890E+00

FATAL from PE   299: There were a total of     72995 locations detected with extreme surface values!
adcroft commented 6 years ago

So the optimization is causing the model to blowup, not hang. That is not an easy problem to debug. The only way I know how is selectively optimize parts of the code and find the culprit by elimination. Which version of Intel is this?

-- Dr Alistair Adcroft (Alistair.Adcroft@noaa.gov) Princeton University Tel: (609) 987-5073 NOAA/GFDL, 201 Forrestal Road, Princeton, NJ 08540

On Mon, Aug 20, 2018 at 6:21 PM, Niki Zadeh notifications@github.com wrote:

The apparent hang of OM4_05 is due to too many u&v truncation errors being written to file that bogs down the cores. The model actually comes down because of too many extreme values:

WARNING from PE 373: Extreme surface sfc_state detected: i= 35 j= 16 x= -9.355 y= 73.607 D= 2.5419E+03 SSH= 6.1646E+02 SST=-5.8620E-01 SSS= 3.4571E+01 U-= NaN U+= NaN V-= NaN V+= NaN WARNING from PE 371: Extreme surface sfc_state detected: i= 5 j= 21 x= -63.757 y= 73.822 D= 1.0036E+03 SSH= 3.7280E+03 SST=-1.3445E+00 SSS= 3.3268E+01 U-= 0.0000E+00 U+=-4.2807E+00 V-= 5.3146E+00 V+=-5.2890E+00

FATAL from PE 299: There were a total of 72995 locations detected with extreme surface values!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/MOM6/issues/836#issuecomment-414482240, or mute the thread https://github.com/notifications/unsubscribe-auth/AFlo8wVOpGHSxaWG84w_JL5CSpxxvwouks5uSzZZgaJpZM4WCI1s .

nikizadehgfdl commented 6 years ago

I compiled MOM_vert_friction.F90 with -O2 and the rest of files with -O3, and both problems went away. So the issue is with -O3 optimization of MOM_vert_friction.F90 in the waves_update commit to dev/gfdl. And the issue was probably there even before that commit and is just tickled by it.

Zhi-Liang commented 6 years ago

Hi Niki,

You may try to split MOM_vert_friction.F90 into multiple files and figure out which routine with -O3 has issue.

Greetings,

Zhi

On Mon, Aug 20, 2018 at 7:12 PM, Niki Zadeh notifications@github.com wrote:

I compiled MOM_vert_friction.F90 with -O2 and the rest of files with -O3, and both problems went away. So the issue is with -O3 optimization of MOM_vert_friction.F90 in the waves_update commit to dev/gfdl https://github.com/NOAA-GFDL/MOM6/commit/de8ed887ef6715c1ff0e715c94a7859dac60a4f1. And the issue was probably there even before that commit and is just tickled by it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/MOM6/issues/836#issuecomment-414493616, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkEkNIr6a9CsimO5BlqhkFE22ahY1vAks5uS0J0gaJpZM4WCI1s .

raymenzel commented 6 years ago

Running the GFDL ESM4 model using GFDL's "production" compiler settings and the top of the dev/gfdl branch for MOM6 results in the same issue Niki described. Basically most ranks detect that the velocity (u) has exceeded the maximum value allowed by the model and start writing to the velocity truncation files, which slows down the model considerably and eventually leads to a crash. When I turn on DEBUG = True in the MOM_input file and compare the stdout to a run done using the MOM6-examples dev/gfdl.2018.04.11 tag, the first difference in answers is reported during the 3rd call to the btcalc routine. Running the code through the debugger, the model starts to slow down during the first call to the vertvisc routine (which is part of the MOM_vert_friction module and contains the call that checks if the velocities are in the expected range).

I am using the intel compiler, version 16.0.3.210 and running on Gaea's c3 and c4 partitions.

Hallberg-NOAA commented 6 years ago

This issue was successfully addressed by PR https://github.com/NOAA-GFDL/MOM6/pull/838, and has therefore been closed.