mom-ocean / MOM5

The Modular Ocean Model
https://mom-ocean.github.io/
GNU Lesser General Public License v3.0
82 stars 96 forks source link

Floating point exception in ocean_bihgen_friction_init #254

Open aekiss opened 6 years ago

aekiss commented 6 years ago

Just noting here a floating point exception I have encountered 3 times in access-om2-01 with fms_ACCESS-OM_afe80bfd.x since September. These are non-reproducible crashes that occur in the first few minutes before any output diagnostics are output. It generally works the second time when I resubmit the same job.

[r324:19323:0] Caught signal 8 (Floating point exception)
==== backtrace ====
 2 0x000000000005a64c mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.6.3104/src/mxm/util/debug/debug.c:641
 3 0x000000000005a7bc mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.6.3104/src/mxm/util/debug/debug.c:616
 4 0x0000000000032510 killpg()  ??:0
 5 0x0000000000ca3e8c ocean_bihgen_friction_mod_mp_ocean_bihgen_friction_init_()  /short/x77/nah599/access-om2/src/mom/src/mom5/ocean_param/lateral/ocean_bihgen_friction.F90:786
 6 0x0000000000c656b7 ocean_bih_friction_mod_mp_ocean_bih_friction_init_()  /short/x77/nah599/access-om2/src/mom/src/mom5/ocean_param/lateral/ocean_bih_friction.F90:219
 7 0x0000000000448332 ocean_model_mod_mp_ocean_model_init_()  /short/x77/nah599/access-om2/src/mom/src/mom5/ocean_core/ocean_model.F90:1317
 8 0x00000000004118ee MAIN__()  /short/x77/nah599/access-om2/src/mom/src/accessom_coupler/ocean_solo.F90:348
 9 0x000000000040ddde main()  ??:0
10 0x000000000001ed1d __libc_start_main()  ??:0
11 0x000000000040dce9 _start()  ??:0
===================

the offending line 786 is

782   ! ensure that background viscosities are not too large
783   do k=1,nk
784      do j=jsc,jec
785         do i=isc,iec
786            if(aiso_back(i,j,k)   > visc_crit(i,j))  aiso_back(i,j,k)   = visc_crit(i,j)
787            if(aaniso_back(i,j,k) > visc_crit(i,j))  aaniso_back(i,j,k) = visc_crit(i,j)
788         enddo
789      enddo
790   enddo
StephenGriffies commented 6 years ago

As the offending code is within the model initialization, I wonder what you mean by

"occur in the first few minutes before any output diagnostics are output."

These lines of code should only be accessed during model initialization.

aekiss commented 6 years ago

I meant the first few minutes of walltime, ie in initialization.

StephenGriffies commented 6 years ago

As the offending code is within the model initialization, I wonder what you mean by

"occur in the first few minutes before any output diagnostics are output."

These lines of code should only be accessed during model initialization.

StephenGriffies commented 6 years ago

Ok, so that is odd. It does appear to be a system issue, no? As this line of code looks innocent, I wonder where the actual problem might be. Do you have a strategy for uncovering the issue?

shweta121sharma commented 4 years ago

Hi Andrew, I am also getting a floating-point exception issue during the model initialization. Could you please let me know, how you have resolved the issue?

https://gist.github.com/shweta121sharma/b5c46fec57b68c801c1c9e285da36ae3

russfiedler commented 4 years ago

Hi @shweta121sharma Your error crops up due to a negative salinity occurring and is unrelated to the error here. Check that there are no negative salinities in your initial restart file including missing values. The initialisation can do some vertical interpolation in topog.F90 and you get negatives creeping into the wet ocean. I got caught with this interpolating WOA13 to a fine grid and didn't quite fill enough points.

Also these errors like yours are better first reported to the MOM mailing list (this is best) as others may have experienced it or on one of the ARCCSS slack channels. If there is an issue with the code we can progress it to here on github.