ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
56 stars 119 forks source link

GST_release_public_v1 test fails on Hera in latest develop #874

Closed mkavulich closed 1 year ago

mkavulich commented 1 year ago

Expected behavior

WE2E test GST_release_public_v1 should run successfully on all platforms.

Current behavior

Currently the test fails at the run_fcst step with the line

FATAL from PE 7: compute_qs: saturation vapor pressure table overflow, nbad= 1

followed by a core dump. This typically indicates a CFL violation/model instability.

Full log file can be found below. This occurs in the current develop as well as hash f9696e1 (July 10), but likely occurs in earlier hashes as well.

Machines affected

Hera. Have not noticed this on other machines, but I can not be sure if this is Hera-specific or not.

Edit: note that this is for the Intel compiler, in community mode (GNU compiler seems to succeed strangely). I have not tested in NCO mode.

Steps To Reproduce

  1. Run WE2E test
  2. Observe failure at run_fcst step.

Output

run_fcst_mem000_2019061500.log

MichaelLueken commented 1 year ago

@mkavulich -

Very interesting. I take it that the test is failing on Hera using the Intel compiler? I ask because the Hera coverage tests are passing, and GST_release_public_v1 is part of the Hera GNU coverage suite. I wonder why the test is failing for Hera Intel, but not Hera GNU.

mkavulich commented 1 year ago

Yes, sorry for the missing detail: this is for Intel. Here is a link to my working directory for the latest develop: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/2023-07-26/expt_dirs/GST_release_public_v1

MichaelLueken commented 1 year ago

The GST_release_public_v1 test also fails on Orion, with the same error message:

FATAL from PE 7: compute_qs: saturation vapor pressure table overflow, nbad= 1

at the exact same location (~27 steps).

The link to my working directory for the latest develop on Orion is: /work/noaa/epic-ps/mlueken/expt_dirs/GST_release_public_v1

MichaelLueken commented 1 year ago

PR #799 (hash 294e18b) appears to be the point that the GST_release_public_v1 test began failing on Intel systems. DT_ATMOS was already decreased to address issues with RRFS_CONUS_25km tests with FV3_GFS_v15p2 CCPP physics. Will try testing with different DT_ATMOS settings to see if the test can once again pass.

mkavulich commented 1 year ago

Thanks @MichaelLueken, that makes sense since the failure seems to be model instability again. Since this was a test specifically for the v1 release, it might make sense to return to the DT_ATMOS= 40 used in that release for that specific test. But a higher value would probably also work.

MichaelLueken commented 1 year ago

@mkavulich -

I tried various DT_ATMOS values (40 - 400) for the GST_release_public_v1 test on Hera Intel, and only setting this to 40 allowed the test to pass. Values higher than 400 led to segfaults in run_fcst. Unfortunately, running the GST_release_public_v1 test on Hera GNU, using DT_ATMOS=40, led the test to fail due to CFL violations:

FATAL from PE 2: compute_qs: saturation vapor pressure table overflow, nbad= 1

So, it looks like the test will only pass for either GNU compilers or Intel compilers.

Are there other parameters that can be tweaked to try and correct these errors, or will we need to add a GST_release_public_v1_intel and GST_release_public_v1_gnu, set DT_ATMOS=40 for GST_release_public_v1_intel, create comprehensive*gnu suites that use GST_release_public_v1_gnu, and change the current comprehensive suites to use GST_release_public_v1_intel?

mkavulich commented 1 year ago

I don't think a convoluted solution is necessary. This is an old test using now-unsupported data and a now-unsupported physics suite. And we don't actually know if it originally worked on GNU hera since that wasn't tested regularly until recently.

I am almost of the mind that the test should be removed (for the above reasons) if it can't be fixed for all platforms, but this is something that probably needs wider discussion.

MichaelLueken commented 1 year ago

From the August 3rd SRW App Code Management meeting, @gsketefian noted that the GST_release_public_v1 test was only meant for SRWv1 testing, so it can be removed now.