ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

sporadic floating point errors in FV3/atmos_cubed_sphere/model/a2b_edge.F90 for nested configurations #2360

Open SamuelTrahanNOAA opened 1 month ago

SamuelTrahanNOAA commented 1 month ago

Description

Regional configurations abort sporadically with a floating-point exception in subroutine a2b_ord2 in FV3/atmos_cubed_sphere/model/a2b_edge.F90 on Hera here:

    if (gridstruct%grid_type < 3) then

       if (gridstruct%bounded_domain) then

          do j=js-2,je+1+2   
             do i=is-2,ie+1+2
                qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j)) ! <------- crashes here
             enddo
          enddo

       else

The crash is a floating-point exception. There are only additions and multiplications, so the exception is probably from a NaN. This could be due to uninitialized memory, or due to not filling boundary conditions (which are initialized with signalling NaN).

Crashes seems to start with hash 8e7b61b1 in PR #2327 which adds a new omega calculation to the dynamical core. It's hard to be certain, since the crash doesn't happen every time.

Presently, the regression test system lacks any error checking, so it cannot distinguish between crashes like these, and a test's results changing.

To Reproduce:

  1. Enable error checking in the workflow, so it'll pause on error instead of reporting the test as changing results.
  2. Run the regression tests on Hera a few times.
  3. Check for floating point exceptions in failed tests.

Additional context

Only tested on Hera.

SamuelTrahanNOAA commented 1 month ago

My PR description had an error: all regional configurations are affected, whether they have a nest or not.

climbfuji commented 2 weeks ago

Was this closed by #2335 ?

SamuelTrahanNOAA commented 2 weeks ago

This PR fixed it: