ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

Quartet of bug fixes for: c3 scheme, quilting restart with 32-bit physics, and string length mismatch in dycore (plus PR #1913, #1917, and #1926) #1893

Closed SamuelTrahanNOAA closed 8 months ago

SamuelTrahanNOAA commented 8 months ago

PR Author Checklist:

Description

There are three bug fixes, and associated changes to regression tests.

  1. The write component no longer calls recover_fields for restart files. (From @DusanJovic-NOAA.) Doing so causes sporadic crashes due to corrupted longitudes when 32-bit physics is used. This only happens for restart files, who don't need recover_fields anyway. Hence, this PR restricts that call to history files.
  2. Correct a string length mismatch in the dycore. This causes an abort when the gnu compiler is run with debug flags turned on. One of the checks enabled is to abort on standard violations of this type.
  3. Changes some subroutine array argument declarations so the c3 scheme can handle null pointers. This is required by the CCPP coding standards anyway, for that very reason. Without this change, the FV3_HRRR_c3 suite crashes with the gnu compiler.
  4. Initialize arrays in dycore. (From @DusanJovic-NOAA.) Two arrays in the dycore were not initialized. That caused crashes in the quilting restart when 32-bit physics is used.
  5. Removes an incorrect dependency of conus13km_2threads on conus13km_control in rt.conf.
  6. Reduces the forecast length of hrrr tests to 12 hours, which is the latest hour that is verified.
  7. Adds Rocoto support on Cheyenne and GAEA.
  8. Merged into this PR is #1917
  9. Also includes #1926
  10. And includes #1913, which is expected to be merged before this PR.

Linked Issues and Pull Requests

This issue still occurs on Cheyenne, but no other machines. Cheyenne is going to be retired soon, so I'll assume the issue is fixed until we see evidence to the contrary:

Subcomponent issues:

From #1917:

From #1926

Subcomponent Pull Requests

All are in fv3atm, or its subcomponents.

Blocking Dependencies

None.

Subcomponents involved:

Anticipated Changes

Input data

Regression Tests:

All FV3 tests.

Libraries

Code Managers Log - [ ] This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. - [ ] Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems. - [ ] N/A ### Testing Log: - RDHPCS - [x] Hera - [x] Orion - [x] Jet - [x] Gaea - [ ] Cheyenne - WCOSS2 - [ ] Dogwood/Cactus - [x] Acorn - CI - [ ] Completed - opnReqTest - [ ] N/A - [ ] Log attached to comment
SamuelTrahanNOAA commented 8 months ago

@jkbk2004 @BrianCurtis-NOAA - We discussed at today's meeting that I should merge this with #1917. Could you please confirm that you want me to merge #1917 into this PR?

grantfirl commented 8 months ago

@jkbk2004 @BrianCurtis-NOAA - We discussed at today's meeting that I should merge this with #1917. Could you please confirm that you want me to merge #1917 into this PR?

I know that you're not asking me, but I think that this is fine. I've approved https://github.com/ufs-community/ccpp-physics/pull/106 so both CCPP PRs are approved at this point in case you were waiting on that.

SamuelTrahanNOAA commented 8 months ago

I've updated to the head of develop and everything still works.

Now I'm testing on top of #1903, which the code managers are preparing to merge now.

jkbk2004 commented 8 months ago

@jkbk2004 @BrianCurtis-NOAA - We discussed at today's meeting that I should merge this with #1917. Could you please confirm that you want me to merge #1917 into this PR?

I know that you're not asking me, but I think that this is fine. I've approved ufs-community/ccpp-physics#106 so both CCPP PRs are approved at this point in case you were waiting on that.

I agree with @grantfirl as well.

SamuelTrahanNOAA commented 8 months ago

Once my test on top of 1903 has completed, I'll merge 1917 and test again.

SamuelTrahanNOAA commented 8 months ago

I've merged #1903 to this PR and everything still works. The latest hera test log is in the repo.

On my working copy, I've merged #1917 to this branch. Tests are about 60% complete.

SamuelTrahanNOAA commented 8 months ago

Now, finally, I've finished testing #1917 and #1903 in this PR. Both are merged into this one, and everything works as expected.

However, I noticed some unnecessary task dependencies in the conus13km regression tests. I've documented them at #1921. I'll fix those in my next PR unless someone else gets to them first.

SamuelTrahanNOAA commented 8 months ago

I've updated this branch and its submodules to the head of the authoritative branch. It is ready for testing.

jkbk2004 commented 8 months ago

@SamuelTrahanNOAA #1903 was merged. Can you sync up to update fv3 hash?

SamuelTrahanNOAA commented 8 months ago

I already synced up the hashes.

SamuelTrahanNOAA commented 8 months ago

I synced up the hashes of each repo immediately after its pull request was merged. The four branches are up-to-date with each PR's authoritative branch.

SamuelTrahanNOAA commented 8 months ago

@jkbk2004 - Please do not generate a new baseline for all tests. The existing baseline files should not change.

I'm just adding some files and removing some files.

  1. Remove tests hrrr_gf and hrrr_c3
  2. Add new tests hrrr_gf_debug and hrrr_c3_debug
  3. Add restart files to existing test conus13km_debug

To properly test this PR, we must confirm no other test changes results.

jkbk2004 commented 8 months ago

@jkbk2004 - Please do not generate a new baseline for all tests. The existing baseline files should not change.

I'm just adding some files and removing some files.

  1. Remove tests hrrr_gf and hrrr_c3
  2. Add new tests hrrr_gf_debug and hrrr_c3_debug
  3. Add restart files to existing test conus13km_debug

To properly test this PR, we must confirm no other test changes results.

@SamuelTrahanNOAA Thanks for the note!

FernandoAndrade-NOAA commented 8 months ago

some jenkins ORTs failed due to memory allocation exceeded errors, rerunning manually on hera

SamuelTrahanNOAA commented 8 months ago

some jenkins ORTs failed due to memory allocation exceeded errors, rerunning manually on hera

None of these changes should increase memory usage. I was able to run the full regression test suite on Hera. Your problem might be unrelated to these changes.

FernandoAndrade-NOAA commented 8 months ago

ORTs have passed with manual runs on Hera, however I do not have permission to push to this PR branch @SamuelTrahanNOAA. If you could add me or if you'd prefer using your own local clone on Hera to push: /scratch2/NAGAPE/epic/Fernando.Andrade-maldonado/regression-tests/ufswm/1893/cpld_p8_gnu/ufs-weather-model/tests/logs

OpnReqTests_control_p8_hera.log
OpnReqTests_cpld_control_nowave_noaero_p8_hera.log
OpnReqTests_regional_control_hera.log
SamuelTrahanNOAA commented 8 months ago

however I do not have permission to push to this PR branch

@FernandoAndrade-NOAA - I just invited you as a collaborator to my repository. If you accept, you should be able to push the logs.

jkbk2004 commented 8 months ago

Looks like reproducibility issue on hercules even for some cases that are not related to this PR: rap_control_gnu, rap_sfcdiff_restart_gnu, etc.

baseline dir = /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922/rap_control_gnu
working dir  = /work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/rap_control_gnu
Checking test 179 rap_control_gnu results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf021.nc .........OK
 Comparing sfcf024.nc .........OK
 Comparing atmf000.nc .........OK
 Comparing atmf021.nc .........OK
 Comparing atmf024.nc .........OK
 Comparing GFSFLX.GrbF00 .........OK
 Comparing GFSFLX.GrbF21 .........NOT OK
 Comparing GFSFLX.GrbF24 .........OK
 Comparing GFSPRS.GrbF00 .........OK
 Comparing GFSPRS.GrbF21 .........OK
 Comparing GFSPRS.GrbF24 .........NOT OK
SamuelTrahanNOAA commented 8 months ago

Looks like reproducibility issue on hercules even for some cases that are not related to this PR: rap_control_gnu, rap_sfcdiff_restart_gnu, etc.

It's interesting that the trouble is only in the GRIB files. Perhaps there is a bug in the connection to the post?

SamuelTrahanNOAA commented 8 months ago

@jkbk2004 - Does the model self-reproduce? By that, I mean:

  1. Generate new baselines for those tests.
  2. Validate against the baseline a few times.

Does step 2 succeed every time?

EDIT: You need to disable job resubmission for this test. Otherwise, the job will "succeed" because the second run passed, even though the first run failed.

SamuelTrahanNOAA commented 8 months ago

@jkbk2004 - Please continue testing on other platforms. If we see more platforms don't reproduce, then something may be wrong with the PR. If it's only Hercules, then we may have a system-specific issue.

Pinging @DusanJovic-NOAA for comment.

SamuelTrahanNOAA commented 8 months ago

I re-read all of the changes and I see a pair that might change the results for any test. There were two uninitialized variables in the dycore that are now initialized to zero.

That's the first two lines changed in here:

https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/280/files#diff-3006d128a8b46ac1a65f4aee1f414a84c5603de8c735570d548955f410983cf6R135

Again, pinging @DusanJovic-NOAA for comment. He made that change.

DusanJovic-NOAA commented 8 months ago

I re-read all of the changes and I see a pair that might change the results for any test. There were two uninitialized variables in the dycore that are now initialized to zero.

That's the first two lines changed in here:

https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/280/files#diff-3006d128a8b46ac1a65f4aee1f414a84c5603de8c735570d548955f410983cf6R135

Again, pinging @DusanJovic-NOAA for comment. He made that change.

The change in dycore is made in module that writes out restart files, I do not see how adding initialization of two local arrays in that module can change the inline post grib files.

BrianCurtis-NOAA commented 8 months ago

@FernandoAndrade-NOAA Do we need to update the bl_date? I was under the assumption that only new baselines are needed? No change anticipated to all other baselines.

FernandoAndrade-NOAA commented 8 months ago

@FernandoAndrade-NOAA Do we need to update the bl_date? I was under the assumption that only new baselines are needed? No change anticipated to all other baselines.

The initial run showed changed results for the tests listed in the PR description so I only recreated for those and copied over the 0922 BLs to 0926, the run after that was successful

SamuelTrahanNOAA commented 8 months ago

The change in dycore is made in module that writes out restart files, I do not see how adding initialization of two local arrays in that module can change the inline post grib files.

Then I'm out of ideas for how the hercules results could have changed.

jkbk2004 commented 8 months ago

@SamuelTrahanNOAA it will be good to double check when Hercules come back tomorrow. #1910 is no baseline change. It will be interesting to test further Hercules thru Phil's PR. So can we hold this PR a bit and move on to #1910 today to save time a bit?

SamuelTrahanNOAA commented 8 months ago

So can we hold this PR a bit and move on to https://github.com/ufs-community/ufs-weather-model/pull/1910 today to save time a bit?

I'll ask @hu5970 whether that's okay. The bug fixes are for the RRFS parallel. Your plan will probably delay this PR past the weekend. That means it may be delayed for weeks if there's another long government shutdown.

hu5970 commented 8 months ago

@Sam, We do want to test the latest model code this week if possible. It would be helpful if this PR can get in. But we are not using the quilting restart yet (will try when the code is ready). If the quilting restart for 32-bit CCPP is the only function from this PR that will impact RRFS parallel. This PR can be delayed a little. I hope this delay will not cause too much trouble in the future merge because of shutdown.

jkbk2004 commented 8 months ago

We can make another quick decision tomorrow when Hercules come back online. I mean to skip Hercules with creating issue or not.

jkbk2004 commented 8 months ago

regional_atmaq_intel was crashing with error around

file: FV3/io/module_write_netcdf.F90 line:          679 NetCDF: HDF error

@SamuelTrahanNOAA @DusanJovic-NOAA experiment path is at /work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/regional_atmaq_intel/err

SamuelTrahanNOAA commented 8 months ago

@jkbk2004 - I can't access this directory:

/work/noaa/epic-ps/jongkim/hercules-1893

The world access permissions are set to 0

SamuelTrahanNOAA commented 8 months ago

I can't access the baselines on Hercules either:

change_dir "/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922" failed: Permission denied (13)

jkbk2004 commented 8 months ago

@SamuelTrahanNOAA I reset the permission. Can you try?

SamuelTrahanNOAA commented 8 months ago

Still can't access it:

[strahan@hercules-login-1 tests]$ ls /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922
ls: cannot open directory '/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922': Permission denied
[strahan@hercules-login-1 tests]$ ls /work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/regional_atmaq_intel/err
ls: cannot access '/work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/regional_atmaq_intel/err': Permission denied
SamuelTrahanNOAA commented 8 months ago

There's definitely some filesystem issues on Orion right now:

+ mv /work/noaa/stmp/strahan/stmp/strahan/FV3_RT/rt_361337/compile_s2sw_debug_intel/build_fv3_s2sw_debug_intel/ufs_model /work/noaa/wrfruc/strahan/qr-c3-stringlen/tests/fv3_s2sw_debug_intel.exe
mv: cannot move '/work/noaa/stmp/strahan/stmp/strahan/FV3_RT/rt_361337/compile_s2sw_debug_intel/build_fv3_s2sw_debug_intel/ufs_model' to a subdirectory of itself, '/work/noaa/wrfruc/strahan/qr-c3-stringlen/tests/fv3_s2sw_debug_intel.exe'

The error message is nonsense. Executable ufs_model isn't a directory, and the destination is not its subdirectory.

SamuelTrahanNOAA commented 8 months ago

I'm generating my own baseline on Hercules off of the head of develop. I'll use that for testing.

On Orion, I'm unable to run the regression tests due to filesystem issues.

jkbk2004 commented 8 months ago

I'm generating my own baseline on Hercules off of the head of develop. I'll use that for testing.

On Orion, I'm unable to run the regression tests due to filesystem issues.

@SamuelTrahanNOAA I think tests run ok on orion. But only issue on hercules. Do you want me to set a hercules single test (regional_atmaq_intel) on somewhere stmp? That one crashes. I wonder it may lead to a clue.

SamuelTrahanNOAA commented 8 months ago

I think tests run ok on orion. But only issue on hercules. Do you want me to set a hercules single test (regional_atmaq_intel) on somewhere stmp? That one crashes. I wonder it may lead to a clue.

That test passed for me:

baseline dir = /work/noaa/wrfruc/strahan/RT-herc/NEMSfv3gfs/develop-20230926/regional_atmaq_intel
working dir  = /work2/noaa/stmp/strahan/stmp/strahan/FV3_RT/rt_537937/regional_atmaq_intel
Checking test 171 regional_atmaq_intel results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf003.nc .........OK
 Comparing sfcf006.nc .........OK
 Comparing atmf000.nc .........OK
 Comparing atmf003.nc .........OK
 Comparing atmf006.nc .........OK
 Comparing RESTART/20190801.180000.coupler.res .........OK
 Comparing RESTART/20190801.180000.fv_core.res.nc .........OK
 Comparing RESTART/20190801.180000.fv_core.res.tile1.nc .........OK
 Comparing RESTART/20190801.180000.fv_srf_wnd.res.tile1.nc .........OK
 Comparing RESTART/20190801.180000.fv_tracer.res.tile1.nc .........OK
 Comparing RESTART/20190801.180000.phy_data.nc .........OK
 Comparing RESTART/20190801.180000.sfc_data.nc .........OK

  0: The total amount of wall time                        = 541.305038
  0: The maximum resident set size (KB)                   = 5501312

Test 171 regional_atmaq_intel PASS

My baseline is from develop, hash f7a94ce3

SamuelTrahanNOAA commented 8 months ago

All tests passed on hera, jet, and hercules. I've pushed the logs. I'm testing orion now.

GAEA is refusing to start ecflow, so I'm adding Rocoto support.

SamuelTrahanNOAA commented 8 months ago

I won't be able to run regression tests on GAEA because my account cannot access the eslogin queue.

It looks like you stopped supporting Cheyenne about a month ago. Is that right? I can't run the regression tests off of the head of develop.

SamuelTrahanNOAA commented 8 months ago

Now that hercules, jet, orion, and hera passed, I'm proceeding to WCOSS. I'll start Acorn and Cactus tests momentarily.

I've opened a ticket on my inability to access GAEA's eslogin queue.

jkbk2004 commented 8 months ago

Yes, cheyenne account is in overspent situation. I was trying hercules again for regional_aqm and rap_control. It is running ok. Sounds like I was trapped before the hercules maintenance. @SamuelTrahanNOAA we are about to merge #1910. Next one is Grant's high priority PR #1913. Risk is the delay during the government shutdown. GFDL side, fedral code managers may not be available during the shutdown. RRFS experiment is not exercising quilt restart yet. So, chance to delay this PR in case the gov shutdown. Would that be ok? But note that EMC and EPIC code managers will likely to work during the shutdown.

SamuelTrahanNOAA commented 8 months ago

@jkbk2004 - Please proceed with your plan. PR #1913 is higher priority than this one.

I'm able to run on Cheyenne now. It failed before because my clone was incomplete.

By tomorrow, I should be able to confirm that every platform works except GAEA. I cannot run the "compile" jobs on GAEA because my account can't access the eslogin queue.

SamuelTrahanNOAA commented 8 months ago

The Cactus tests passed. My project doesn't have enough space to run an Acorn regression test.

On Cheyenne, the conus13km_debug_qr are getting an "hdf error" for both compilers. If they keep doing that, I'll disable the test on Cheyenne.

jkbk2004 commented 8 months ago

@SamuelTrahanNOAA Cheyenne is practically in decommission status due to account issue. I think we will start using Derecho soon. So, today's decision point is about combining the cubed sphere PRs (https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/276, https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/280, and https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/285). I understand you will be ok to combine 280 and 285. What about 276? As @junwang-noaa confirms only attribute change with 276, all those three PRs can be combined and merged today on GFDL side first before possible government shutdown. In that case, we just need to regenerate whole baseline when we pick up this PR next week. Can we agree?

SamuelTrahanNOAA commented 8 months ago

I already know everything works on six machines, so I guess it's okay to add answer-changing PRs.

I'll update to develop, add the dycore PR 276, and merge #1913. Then I'll test everywhere again.

SamuelTrahanNOAA commented 8 months ago

@jkbk2004 - Would you be okay with adding #1926? The RRFS developers need it to reduce the size of their surface restart files.

jkbk2004 commented 8 months ago

I agree with combing #1926. I will send out email to GFDL side to get cubed sphere PRs 276/280/285 merged today.

SamuelTrahanNOAA commented 8 months ago

I have now:

  1. Updated to develop
  2. Merged #1926
  3. Merged https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/276
  4. Merged https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/285

Next, I'll:

  1. Run a few dozen tests of this batch of PRs on Cheyenne.
  2. Run #1913's tests that are expected to change on Cheyenne.
  3. Once those work, merge #1913 into this PR.
  4. Test the final combination on all machines I can access.