Closed mkavulich closed 1 year ago
@mkavulich -
Very interesting. I take it that the test is failing on Hera using the Intel compiler? I ask because the Hera coverage tests are passing, and GST_release_public_v1
is part of the Hera GNU coverage suite. I wonder why the test is failing for Hera Intel, but not Hera GNU.
Yes, sorry for the missing detail: this is for Intel. Here is a link to my working directory for the latest develop: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/2023-07-26/expt_dirs/GST_release_public_v1
The GST_release_public_v1
test also fails on Orion, with the same error message:
FATAL from PE 7: compute_qs: saturation vapor pressure table overflow, nbad= 1
at the exact same location (~27 steps).
The link to my working directory for the latest develop on Orion is:
/work/noaa/epic-ps/mlueken/expt_dirs/GST_release_public_v1
PR #799 (hash 294e18b) appears to be the point that the GST_release_public_v1
test began failing on Intel systems. DT_ATMOS
was already decreased to address issues with RRFS_CONUS_25km
tests with FV3_GFS_v15p2
CCPP physics. Will try testing with different DT_ATMOS
settings to see if the test can once again pass.
Thanks @MichaelLueken, that makes sense since the failure seems to be model instability again. Since this was a test specifically for the v1 release, it might make sense to return to the DT_ATMOS= 40 used in that release for that specific test. But a higher value would probably also work.
@mkavulich -
I tried various DT_ATMOS
values (40 - 400) for the GST_release_public_v1
test on Hera Intel, and only setting this to 40 allowed the test to pass. Values higher than 400 led to segfaults in run_fcst
. Unfortunately, running the GST_release_public_v1
test on Hera GNU, using DT_ATMOS
=40, led the test to fail due to CFL violations:
FATAL from PE 2: compute_qs: saturation vapor pressure table overflow, nbad= 1
So, it looks like the test will only pass for either GNU compilers or Intel compilers.
Are there other parameters that can be tweaked to try and correct these errors, or will we need to add a GST_release_public_v1_intel
and GST_release_public_v1_gnu
, set DT_ATMOS
=40 for GST_release_public_v1_intel
, create comprehensive*gnu
suites that use GST_release_public_v1_gnu
, and change the current comprehensive suites to use GST_release_public_v1_intel
?
I don't think a convoluted solution is necessary. This is an old test using now-unsupported data and a now-unsupported physics suite. And we don't actually know if it originally worked on GNU hera since that wasn't tested regularly until recently.
I am almost of the mind that the test should be removed (for the above reasons) if it can't be fixed for all platforms, but this is something that probably needs wider discussion.
From the August 3rd SRW App Code Management meeting, @gsketefian noted that the GST_release_public_v1 test was only meant for SRWv1 testing, so it can be removed now.
Expected behavior
WE2E test GST_release_public_v1 should run successfully on all platforms.
Current behavior
Currently the test fails at the run_fcst step with the line
FATAL from PE 7: compute_qs: saturation vapor pressure table overflow, nbad= 1
followed by a core dump. This typically indicates a CFL violation/model instability.
Full log file can be found below. This occurs in the current develop as well as hash f9696e1 (July 10), but likely occurs in earlier hashes as well.
Machines affected
Hera. Have not noticed this on other machines, but I can not be sure if this is Hera-specific or not.
Edit: note that this is for the Intel compiler, in community mode (GNU compiler seems to succeed strangely). I have not tested in NCO mode.
Steps To Reproduce
Output
run_fcst_mem000_2019061500.log