Closed nikizadehgfdl closed 6 years ago
Both of the above anomalies kick in with the waves_update commit to dev/gfdl. Both models work fine before that update.
In these experiments USE_WAVES = False, but there are updates in src/parameterizations/vertical/MOM_vert_friction.F90 that might cause these, consistent with what @raymenzel has observed using a debugger.
@breichl, could you check to see what in these updates might not be liked by the compiler optimization routines (loop reordering, ...)? We'll dig in too.
The apparent hang of OM4_05 is due to too many u&v truncation errors being written to file that bogs down the cores. The model actually comes down because of too many extreme values:
WARNING from PE 373: Extreme surface sfc_state detected: i= 35 j= 16 x= -9.355 y= 73.607 D= 2.5419E+03 SSH= 6.1646E+02 SST=-5.8620E-01 SSS= 3.4571E+01 U-= NaN U+= NaN V-= NaN V+= NaN
WARNING from PE 371: Extreme surface sfc_state detected: i= 5 j= 21 x= -63.757 y= 73.822 D= 1.0036E+03 SSH= 3.7280E+03 SST=-1.3445E+00 SSS= 3.3268E+01 U-= 0.0000E+00 U+=-4.2807E+00 V-= 5.3146E+00 V+=-5.2890E+00
FATAL from PE 299: There were a total of 72995 locations detected with extreme surface values!
So the optimization is causing the model to blowup, not hang. That is not an easy problem to debug. The only way I know how is selectively optimize parts of the code and find the culprit by elimination. Which version of Intel is this?
-- Dr Alistair Adcroft (Alistair.Adcroft@noaa.gov) Princeton University Tel: (609) 987-5073 NOAA/GFDL, 201 Forrestal Road, Princeton, NJ 08540
On Mon, Aug 20, 2018 at 6:21 PM, Niki Zadeh notifications@github.com wrote:
The apparent hang of OM4_05 is due to too many u&v truncation errors being written to file that bogs down the cores. The model actually comes down because of too many extreme values:
WARNING from PE 373: Extreme surface sfc_state detected: i= 35 j= 16 x= -9.355 y= 73.607 D= 2.5419E+03 SSH= 6.1646E+02 SST=-5.8620E-01 SSS= 3.4571E+01 U-= NaN U+= NaN V-= NaN V+= NaN WARNING from PE 371: Extreme surface sfc_state detected: i= 5 j= 21 x= -63.757 y= 73.822 D= 1.0036E+03 SSH= 3.7280E+03 SST=-1.3445E+00 SSS= 3.3268E+01 U-= 0.0000E+00 U+=-4.2807E+00 V-= 5.3146E+00 V+=-5.2890E+00
FATAL from PE 299: There were a total of 72995 locations detected with extreme surface values!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/MOM6/issues/836#issuecomment-414482240, or mute the thread https://github.com/notifications/unsubscribe-auth/AFlo8wVOpGHSxaWG84w_JL5CSpxxvwouks5uSzZZgaJpZM4WCI1s .
I compiled MOM_vert_friction.F90 with -O2 and the rest of files with -O3, and both problems went away. So the issue is with -O3 optimization of MOM_vert_friction.F90 in the waves_update commit to dev/gfdl. And the issue was probably there even before that commit and is just tickled by it.
Hi Niki,
You may try to split MOM_vert_friction.F90 into multiple files and figure out which routine with -O3 has issue.
Greetings,
Zhi
On Mon, Aug 20, 2018 at 7:12 PM, Niki Zadeh notifications@github.com wrote:
I compiled MOM_vert_friction.F90 with -O2 and the rest of files with -O3, and both problems went away. So the issue is with -O3 optimization of MOM_vert_friction.F90 in the waves_update commit to dev/gfdl https://github.com/NOAA-GFDL/MOM6/commit/de8ed887ef6715c1ff0e715c94a7859dac60a4f1. And the issue was probably there even before that commit and is just tickled by it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/MOM6/issues/836#issuecomment-414493616, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkEkNIr6a9CsimO5BlqhkFE22ahY1vAks5uS0J0gaJpZM4WCI1s .
Running the GFDL ESM4 model using GFDL's "production" compiler settings and the top of the dev/gfdl branch for MOM6 results in the same issue Niki described. Basically most ranks detect that the velocity (u) has exceeded the maximum value allowed by the model and start writing to the velocity truncation files, which slows down the model considerably and eventually leads to a crash. When I turn on DEBUG = True in the MOM_input file and compare the stdout to a run done using the MOM6-examples dev/gfdl.2018.04.11 tag, the first difference in answers is reported during the 3rd call to the btcalc routine. Running the code through the debugger, the model starts to slow down during the first call to the vertvisc routine (which is part of the MOM_vert_friction module and contains the call that checks if the velocities are in the expected range).
I am using the intel compiler, version 16.0.3.210 and running on Gaea's c3 and c4 partitions.
This issue was successfully addressed by PR https://github.com/NOAA-GFDL/MOM6/pull/838, and has therefore been closed.
OM4_05 example hangs when executable is made in prod mode (-O3) . It runs OK with repro (-O2) and debug(-O0). This is with dev/gfdl top of the branch (MOM6 commit 773902a7, SIS2 commit fbf4ab59946 , warsaw_201803 for the rest).
Also, Baltic test case crashes in prod mode with the following traceback, but runs OK in repro mode.
Note that prod mode is the current settings for all production experiments and this was not an issue with dev/gfdl/2018.04.11 tag or before.