ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

feature test issues for rrfs_smoke_conus13km_hrrr_warm #1222

Open junwang-noaa opened 2 years ago

junwang-noaa commented 2 years ago

Description

PR #1195 added a feature test rrfs_smoke_conus13km_hrrr_warm using suite file FV3_HRRR_smoke. The test owner needs to confirm that the feature test can reproduce results with different threads, decomposition, mpi tasks and in restart mode. It can also run in debug mode. Currently the test failed with decomposition and debug test.

To Reproduce:

Check out the branch in PR#1195, run rrfs_smoke_conus13km_hrrr_warm with different threading, decomposition, mpi tasks, in restart mode and debug mode.

Additional context

Add any other context about the problem here. Directly reference any issues or PRs in this or other repositories that this is related to, and describe how they are related. Example:

Output

junwang-noaa commented 2 years ago

The issue was fixed in PR#1257. The issue will be closed.

SamuelTrahanNOAA commented 2 years ago

@junwang-noaa This was NOT fixed in #1257. Please re-open this issue so I don't have to make a new one.

junwang-noaa commented 2 years ago

Sorry, I see the PR #1257 fixed the reproducibility for hrrr_control, not rrfs_smoke_conus13km_hrrr_warm.

SamuelTrahanNOAA commented 2 years ago

Actually, the hrrr_control variants already worked, they just weren't enabled. The reproducibility fix in that PR was for the rap_decomp.

DeniseWorthen commented 2 years ago

Can this issue be closed @junwang-noaa @SamuelTrahanNOAA ?

SamuelTrahanNOAA commented 2 years ago

No. This problem is not resolved.

SamuelTrahanNOAA commented 2 years ago

I can fix the debug and 2threads variants in this PR: https://github.com/ufs-community/ufs-weather-model/pull/1437 Sadly, as yet, I have no fix for the restart or decomp variants.

However, I suspect this bug may be breaking decomp: https://github.com/ufs-community/ufs-weather-model/issues/1436 if it is using data from halo regions. I have no way to fix that bug, nor even confirm my suspicions, since that code goes well beyond my understanding of the boundary generation.

zach1221 commented 1 year ago

I decided to test rrfs_smoke_conus13km_hrrr_warm with the various features decomposition, restart mode, and mpi, (I know debug and 2threads should now be passing with the merging of #1437 ) and it seems everything passed. @SamuelTrahanNOAA have you had the opportunity to test again recently?

SamuelTrahanNOAA commented 1 year ago

They fail for me. How did you test?

You need to use the tests/tests files, not just change environment variables. The RRFS tests ignore several environment variables, and they're always warm starts.

SamuelTrahanNOAA commented 1 year ago

The RRFS has hard-coded values for some variables. If you're using an automated tool that tweaks variables, it won't test anything.

These values are hard-coded:

export INPES=12
export JNPES=12
export WARM_START=.true.

All RRFS runs are warm starts.

To do a restart test, you need to set RRFS_RESTART=YES. For a decomposition test, you need a different tests/tests file with different values for INPES and JNPES.

SamuelTrahanNOAA commented 1 year ago

I just retested hera.gnu and I can confirm the situation is unchanged. I'd like to know how @zach1221 ran the tests. This is not the first time someone has configured the RRFS tests incorrectly and falsely reported that the restart and decomp work. Is the tool "opnReqTest?" If so, I'll add an "if" statement to rrfs_warm_run.IN to abort the test if that tool is enabled.

zach1221 commented 1 year ago

@SamuelTrahanNOAA I see. Well I guess I tested incorrectly. I was just running the tests sequentially out of rt.conf in tests/. Like, ./rt.sh -a nems -n rrfs_smoke_conus13km_hrrr_warm_debug_decomp intel or ./rt.sh -a nems rrfs_smoke_conus13km_hrrr_warm_restart, etc.

I'll try again with the steps you provided to reproduce. Thank you!

SamuelTrahanNOAA commented 1 year ago

The I haven't tried that before.

SamuelTrahanNOAA commented 1 year ago

Use this:

COMPILE | 13 | intel | -DAPP=ATM -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_HRRR_flake,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON | | fv3 |

RUN | rrfs_smoke_conus13km_hrrr_warm                    |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_2threads           |                            |          |
RUN | rrfs_conus13km_hrrr_warm                          |                            | baseline |
RUN | rrfs_smoke_conus13km_radar_tten_warm              |                            | baseline |
RUN | rrfs_smoke_conus13km_hrrr_warm_decomp            |                            |          |
RUN | rrfs_smoke_conus13km_hrrr_warm_restart           |                            |          | rrfs_smoke_conus13km_hrrr_warm
RUN | rrfs_conus13km_hrrr_warm_restart_mismatch         |                            | baseline | rrfs_conus13km_hrrr_warm
zach1221 commented 1 year ago

@SamuelTrahanNOAA thanks, again. Let me try that now.

SamuelTrahanNOAA commented 1 year ago

My branch was not up-to-date with develop, so that test didn't check if the latest version works. It seems the regression test system has changed substantially. I'll have to check if it's even running those tests correctly.

SamuelTrahanNOAA commented 1 year ago

The 2threads test doesn't use 2 threads anymore, but the decomp test still changes the decomposition.

SamuelTrahanNOAA commented 1 year ago

The restart and decomp do not match the control, but they are executed correctly.

It looks like the 2threads is using ESMF to turn on threading, without providing the mandatory OMP_NUM_THREADS variable that sets the maximum number of threads available to ESMF. I will try correcting this and see if it still passes.

SamuelTrahanNOAA commented 1 year ago

The 2threads test still passes if I set OMP_NUM_THREADS (THRD) to 2

SamuelTrahanNOAA commented 1 year ago

The debug_decomp test (rrfs_smoke_conus13km_hrrr_warm_debug_decomp_intel) also fails.