ufs-community / ufs-weather-model

UFS Weather Model
Other
136 stars 244 forks source link

gnv1_nested_intel intermittent inability to match against WM baselines on wcoss/hercules/orion #2127

Open zach1221 opened 7 months ago

zach1221 commented 7 months ago

Description

gnv1_nested_intel will frequently fail to match against baselines on WCOSS2, Orion and Hercules. The test will complete, however will not match ok, even when new baselines are created to ensure any changes to the test are captured.

To Reproduce:

  1. Log into WCOSS2
  2. clone ufs-weather-model develop
  3. adjust rt.conf in ufs-weather-model/tests/ so only gnv1_nested runs.
  4. run test ./rt.sh -a nems -e -l rt.conf

Additional context

Test has failed consistently on WCOSS. Will pass occasionally on Orion and Hercules if run repeatedly. /work2/noaa/stmp/jongkim/stmp/jongkim/FV3_RT/rt_2634567/gnv1_nested_intel

Output

zach1221 commented 7 months ago

@BrianCurtis-NOAA would you be able to add your recent WCOSS experiment path to the additional context section?

zach1221 commented 7 months ago

Hi, @SamuelTrahanNOAA . When you have time, could you help us look into this issue?

SamuelTrahanNOAA commented 7 months ago

When did this start happening?

BrianCurtis-NOAA commented 7 months ago

For me on WCOSS2, a good few weeks, but it's been intermittent. The baselines are still there, but the test is disabled for WCOSS2 for now. You can re-enable it and compare against those baselines, if that helps.

SamuelTrahanNOAA commented 7 months ago

I need to know the specific PR that broke it.

zach1221 commented 7 months ago

I need to know the specific PR that broke it.

I can dig through and find out for you.

zach1221 commented 7 months ago

I need to know the specific PR that broke it.

WM PR#2098 is the furthest back I can find of the test failing to match on the first attempt.

SamuelTrahanNOAA commented 7 months ago

PR 2098 changes some NSSL microphysics code. The regression test never uses that code. It is likely that either:

  1. A prior PR broke it, or
  2. This problem has always been there, but we didn't notice it until recently.

Debugging a problem like this is difficult when one cannot run in debug mode. UFS nesting does not work in debug mode. It can't even compile with the GNU compiler due to syntax errors. (For example, using . instead of % to access derived type members.)

zach1221 commented 7 months ago

PR 2098 changes some NSSL microphysics code. The regression test never uses that code. It is likely that either:

  1. A prior PR broke it, or
  2. This problem has always been there, but we didn't notice it until recently.

Debugging a problem like this is difficult when one cannot run in debug mode. UFS nesting does not work in debug mode. It can't even compile with the GNU compiler due to syntax errors. (For example, using . instead of % to access derived type members.)

@SamuelTrahanNOAA Are you ok with me closing this issue? It would mean gnv1_nested remains disabled on wcoss, and hercules/orion.

SamuelTrahanNOAA commented 7 months ago

Are you ok with me closing this issue? It would mean gnv1_nested remains disabled on wcoss, and hercules/orion.

No. That regression test must run on all platforms. We must find out why it is failing.

zach1221 commented 7 months ago

Are you ok with me closing this issue? It would mean gnv1_nested remains disabled on wcoss, and hercules/orion.

No. That regression test must run on all platforms. We must find out why it is failing.

Ok, no problem. What do you think is the best way to investigate this further without being able to run in debug?

SamuelTrahanNOAA commented 7 months ago

The only way I can think of is to get debug mode working with UFS FV3 nesting.

SamuelTrahanNOAA commented 6 months ago

I found two bugs in the nesting:

I've got fixes for both of them which I'll PR soon. It's unlikely those will fix this issue since they're both "it crashes or it runs" sorts of bugs.