Closed SamuelTrahanNOAA closed 8 months ago
@jkbk2004 @BrianCurtis-NOAA - We discussed at today's meeting that I should merge this with #1917. Could you please confirm that you want me to merge #1917 into this PR?
@jkbk2004 @BrianCurtis-NOAA - We discussed at today's meeting that I should merge this with #1917. Could you please confirm that you want me to merge #1917 into this PR?
I know that you're not asking me, but I think that this is fine. I've approved https://github.com/ufs-community/ccpp-physics/pull/106 so both CCPP PRs are approved at this point in case you were waiting on that.
I've updated to the head of develop and everything still works.
Now I'm testing on top of #1903, which the code managers are preparing to merge now.
@jkbk2004 @BrianCurtis-NOAA - We discussed at today's meeting that I should merge this with #1917. Could you please confirm that you want me to merge #1917 into this PR?
I know that you're not asking me, but I think that this is fine. I've approved ufs-community/ccpp-physics#106 so both CCPP PRs are approved at this point in case you were waiting on that.
I agree with @grantfirl as well.
Once my test on top of 1903 has completed, I'll merge 1917 and test again.
I've merged #1903 to this PR and everything still works. The latest hera test log is in the repo.
On my working copy, I've merged #1917 to this branch. Tests are about 60% complete.
Now, finally, I've finished testing #1917 and #1903 in this PR. Both are merged into this one, and everything works as expected.
However, I noticed some unnecessary task dependencies in the conus13km regression tests. I've documented them at #1921. I'll fix those in my next PR unless someone else gets to them first.
I've updated this branch and its submodules to the head of the authoritative branch. It is ready for testing.
@SamuelTrahanNOAA #1903 was merged. Can you sync up to update fv3 hash?
I already synced up the hashes.
I synced up the hashes of each repo immediately after its pull request was merged. The four branches are up-to-date with each PR's authoritative branch.
@jkbk2004 - Please do not generate a new baseline for all tests. The existing baseline files should not change.
I'm just adding some files and removing some files.
To properly test this PR, we must confirm no other test changes results.
@jkbk2004 - Please do not generate a new baseline for all tests. The existing baseline files should not change.
I'm just adding some files and removing some files.
- Remove tests hrrr_gf and hrrr_c3
- Add new tests hrrr_gf_debug and hrrr_c3_debug
- Add restart files to existing test conus13km_debug
To properly test this PR, we must confirm no other test changes results.
@SamuelTrahanNOAA Thanks for the note!
some jenkins ORTs failed due to memory allocation exceeded errors, rerunning manually on hera
some jenkins ORTs failed due to memory allocation exceeded errors, rerunning manually on hera
None of these changes should increase memory usage. I was able to run the full regression test suite on Hera. Your problem might be unrelated to these changes.
ORTs have passed with manual runs on Hera, however I do not have permission to push to this PR branch @SamuelTrahanNOAA.
If you could add me or if you'd prefer using your own local clone on Hera to push:
/scratch2/NAGAPE/epic/Fernando.Andrade-maldonado/regression-tests/ufswm/1893/cpld_p8_gnu/ufs-weather-model/tests/logs
OpnReqTests_control_p8_hera.log
OpnReqTests_cpld_control_nowave_noaero_p8_hera.log
OpnReqTests_regional_control_hera.log
however I do not have permission to push to this PR branch
@FernandoAndrade-NOAA - I just invited you as a collaborator to my repository. If you accept, you should be able to push the logs.
Looks like reproducibility issue on hercules even for some cases that are not related to this PR: rap_control_gnu, rap_sfcdiff_restart_gnu, etc.
baseline dir = /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922/rap_control_gnu
working dir = /work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/rap_control_gnu
Checking test 179 rap_control_gnu results ....
Comparing sfcf000.nc .........OK
Comparing sfcf021.nc .........OK
Comparing sfcf024.nc .........OK
Comparing atmf000.nc .........OK
Comparing atmf021.nc .........OK
Comparing atmf024.nc .........OK
Comparing GFSFLX.GrbF00 .........OK
Comparing GFSFLX.GrbF21 .........NOT OK
Comparing GFSFLX.GrbF24 .........OK
Comparing GFSPRS.GrbF00 .........OK
Comparing GFSPRS.GrbF21 .........OK
Comparing GFSPRS.GrbF24 .........NOT OK
Looks like reproducibility issue on hercules even for some cases that are not related to this PR: rap_control_gnu, rap_sfcdiff_restart_gnu, etc.
It's interesting that the trouble is only in the GRIB files. Perhaps there is a bug in the connection to the post?
@jkbk2004 - Does the model self-reproduce? By that, I mean:
Does step 2 succeed every time?
EDIT: You need to disable job resubmission for this test. Otherwise, the job will "succeed" because the second run passed, even though the first run failed.
@jkbk2004 - Please continue testing on other platforms. If we see more platforms don't reproduce, then something may be wrong with the PR. If it's only Hercules, then we may have a system-specific issue.
Pinging @DusanJovic-NOAA for comment.
I re-read all of the changes and I see a pair that might change the results for any test. There were two uninitialized variables in the dycore that are now initialized to zero.
That's the first two lines changed in here:
Again, pinging @DusanJovic-NOAA for comment. He made that change.
I re-read all of the changes and I see a pair that might change the results for any test. There were two uninitialized variables in the dycore that are now initialized to zero.
That's the first two lines changed in here:
Again, pinging @DusanJovic-NOAA for comment. He made that change.
The change in dycore is made in module that writes out restart files, I do not see how adding initialization of two local arrays in that module can change the inline post grib files.
@FernandoAndrade-NOAA Do we need to update the bl_date? I was under the assumption that only new baselines are needed? No change anticipated to all other baselines.
@FernandoAndrade-NOAA Do we need to update the bl_date? I was under the assumption that only new baselines are needed? No change anticipated to all other baselines.
The initial run showed changed results for the tests listed in the PR description so I only recreated for those and copied over the 0922 BLs to 0926, the run after that was successful
The change in dycore is made in module that writes out restart files, I do not see how adding initialization of two local arrays in that module can change the inline post grib files.
Then I'm out of ideas for how the hercules results could have changed.
@SamuelTrahanNOAA it will be good to double check when Hercules come back tomorrow. #1910 is no baseline change. It will be interesting to test further Hercules thru Phil's PR. So can we hold this PR a bit and move on to #1910 today to save time a bit?
So can we hold this PR a bit and move on to https://github.com/ufs-community/ufs-weather-model/pull/1910 today to save time a bit?
I'll ask @hu5970 whether that's okay. The bug fixes are for the RRFS parallel. Your plan will probably delay this PR past the weekend. That means it may be delayed for weeks if there's another long government shutdown.
@Sam, We do want to test the latest model code this week if possible. It would be helpful if this PR can get in. But we are not using the quilting restart yet (will try when the code is ready). If the quilting restart for 32-bit CCPP is the only function from this PR that will impact RRFS parallel. This PR can be delayed a little. I hope this delay will not cause too much trouble in the future merge because of shutdown.
We can make another quick decision tomorrow when Hercules come back online. I mean to skip Hercules with creating issue or not.
regional_atmaq_intel was crashing with error around
file: FV3/io/module_write_netcdf.F90 line: 679 NetCDF: HDF error
@SamuelTrahanNOAA @DusanJovic-NOAA experiment path is at /work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/regional_atmaq_intel/err
@jkbk2004 - I can't access this directory:
/work/noaa/epic-ps/jongkim/hercules-1893
The world access permissions are set to 0
I can't access the baselines on Hercules either:
change_dir "/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922" failed: Permission denied (13)
@SamuelTrahanNOAA I reset the permission. Can you try?
Still can't access it:
[strahan@hercules-login-1 tests]$ ls /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922
ls: cannot open directory '/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20230922': Permission denied
[strahan@hercules-login-1 tests]$ ls /work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/regional_atmaq_intel/err
ls: cannot access '/work/noaa/epic-ps/jongkim/hercules-1893/stmp/jongkim/FV3_RT/rt_3669730/regional_atmaq_intel/err': Permission denied
There's definitely some filesystem issues on Orion right now:
+ mv /work/noaa/stmp/strahan/stmp/strahan/FV3_RT/rt_361337/compile_s2sw_debug_intel/build_fv3_s2sw_debug_intel/ufs_model /work/noaa/wrfruc/strahan/qr-c3-stringlen/tests/fv3_s2sw_debug_intel.exe
mv: cannot move '/work/noaa/stmp/strahan/stmp/strahan/FV3_RT/rt_361337/compile_s2sw_debug_intel/build_fv3_s2sw_debug_intel/ufs_model' to a subdirectory of itself, '/work/noaa/wrfruc/strahan/qr-c3-stringlen/tests/fv3_s2sw_debug_intel.exe'
The error message is nonsense. Executable ufs_model
isn't a directory, and the destination is not its subdirectory.
I'm generating my own baseline on Hercules off of the head of develop. I'll use that for testing.
On Orion, I'm unable to run the regression tests due to filesystem issues.
I'm generating my own baseline on Hercules off of the head of develop. I'll use that for testing.
On Orion, I'm unable to run the regression tests due to filesystem issues.
@SamuelTrahanNOAA I think tests run ok on orion. But only issue on hercules. Do you want me to set a hercules single test (regional_atmaq_intel) on somewhere stmp? That one crashes. I wonder it may lead to a clue.
I think tests run ok on orion. But only issue on hercules. Do you want me to set a hercules single test (regional_atmaq_intel) on somewhere stmp? That one crashes. I wonder it may lead to a clue.
That test passed for me:
baseline dir = /work/noaa/wrfruc/strahan/RT-herc/NEMSfv3gfs/develop-20230926/regional_atmaq_intel
working dir = /work2/noaa/stmp/strahan/stmp/strahan/FV3_RT/rt_537937/regional_atmaq_intel
Checking test 171 regional_atmaq_intel results ....
Comparing sfcf000.nc .........OK
Comparing sfcf003.nc .........OK
Comparing sfcf006.nc .........OK
Comparing atmf000.nc .........OK
Comparing atmf003.nc .........OK
Comparing atmf006.nc .........OK
Comparing RESTART/20190801.180000.coupler.res .........OK
Comparing RESTART/20190801.180000.fv_core.res.nc .........OK
Comparing RESTART/20190801.180000.fv_core.res.tile1.nc .........OK
Comparing RESTART/20190801.180000.fv_srf_wnd.res.tile1.nc .........OK
Comparing RESTART/20190801.180000.fv_tracer.res.tile1.nc .........OK
Comparing RESTART/20190801.180000.phy_data.nc .........OK
Comparing RESTART/20190801.180000.sfc_data.nc .........OK
0: The total amount of wall time = 541.305038
0: The maximum resident set size (KB) = 5501312
Test 171 regional_atmaq_intel PASS
My baseline is from develop, hash f7a94ce3
All tests passed on hera, jet, and hercules. I've pushed the logs. I'm testing orion now.
GAEA is refusing to start ecflow, so I'm adding Rocoto support.
I won't be able to run regression tests on GAEA because my account cannot access the eslogin queue.
It looks like you stopped supporting Cheyenne about a month ago. Is that right? I can't run the regression tests off of the head of develop.
Now that hercules, jet, orion, and hera passed, I'm proceeding to WCOSS. I'll start Acorn and Cactus tests momentarily.
I've opened a ticket on my inability to access GAEA's eslogin queue.
Yes, cheyenne account is in overspent situation. I was trying hercules again for regional_aqm and rap_control. It is running ok. Sounds like I was trapped before the hercules maintenance. @SamuelTrahanNOAA we are about to merge #1910. Next one is Grant's high priority PR #1913. Risk is the delay during the government shutdown. GFDL side, fedral code managers may not be available during the shutdown. RRFS experiment is not exercising quilt restart yet. So, chance to delay this PR in case the gov shutdown. Would that be ok? But note that EMC and EPIC code managers will likely to work during the shutdown.
@jkbk2004 - Please proceed with your plan. PR #1913 is higher priority than this one.
I'm able to run on Cheyenne now. It failed before because my clone was incomplete.
By tomorrow, I should be able to confirm that every platform works except GAEA. I cannot run the "compile" jobs on GAEA because my account can't access the eslogin queue.
The Cactus tests passed. My project doesn't have enough space to run an Acorn regression test.
On Cheyenne, the conus13km_debug_qr are getting an "hdf error" for both compilers. If they keep doing that, I'll disable the test on Cheyenne.
@SamuelTrahanNOAA Cheyenne is practically in decommission status due to account issue. I think we will start using Derecho soon. So, today's decision point is about combining the cubed sphere PRs (https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/276, https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/280, and https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/285). I understand you will be ok to combine 280 and 285. What about 276? As @junwang-noaa confirms only attribute change with 276, all those three PRs can be combined and merged today on GFDL side first before possible government shutdown. In that case, we just need to regenerate whole baseline when we pick up this PR next week. Can we agree?
I already know everything works on six machines, so I guess it's okay to add answer-changing PRs.
I'll update to develop, add the dycore PR 276, and merge #1913. Then I'll test everywhere again.
@jkbk2004 - Would you be okay with adding #1926? The RRFS developers need it to reduce the size of their surface restart files.
I agree with combing #1926. I will send out email to GFDL side to get cubed sphere PRs 276/280/285 merged today.
I have now:
Next, I'll:
PR Author Checklist:
Description
There are three bug fixes, and associated changes to regression tests.
Linked Issues and Pull Requests
This issue still occurs on Cheyenne, but no other machines. Cheyenne is going to be retired soon, so I'll assume the issue is fixed until we see evidence to the contrary:
Subcomponent issues:
From #1917:
From #1926
Subcomponent Pull Requests
All are in fv3atm, or its subcomponents.
Blocking Dependencies
None.
Subcomponents involved:
Anticipated Changes
Input data
Regression Tests:
All FV3 tests.
Libraries
Code Managers Log
- [ ] This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. - [ ] Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems. - [ ] N/A ### Testing Log: - RDHPCS - [x] Hera - [x] Orion - [x] Jet - [x] Gaea - [ ] Cheyenne - WCOSS2 - [ ] Dogwood/Cactus - [x] Acorn - CI - [ ] Completed - opnReqTest - [ ] N/A - [ ] Log attached to comment