Closed WenMeng-NOAA closed 3 weeks ago
@WenMeng-NOAA @FernandoAndrade-NOAA can we schedule to work on this pr anytime this week?
@WenMeng-NOAA @FernandoAndrade-NOAA can we schedule to work on this pr anytime this week?
@jkbk2004 That would be great! Please let me know any actions from my end.
It looks like the test_changes.list was overwritten during the sync, I'm recommitting that as those changes were confirmed to be expected from this update.
@zach1221 @BrianCurtis-NOAA FYI getting started on testing this PR
It looks like the test_changes.list was overwritten during the sync, I'm recommitting that as those changes were confirmed to be expected from this update.
@FernandoAndrade-NOAA Yes, during my syncing process this morning, I was not sure which version of test_changes.list was needed.
It looks like the test_changes.list was overwritten during the sync, I'm recommitting that as those changes were confirmed to be expected from this update.
@FernandoAndrade-NOAA Yes, during my syncing process this morning, I was not sure which version of test_changes.list was needed.
No worries! We should be good to go now, thanks.
There were failures with the creation for the rap_clm_lake_debug_intel
test on Hera, Gaea, and Jet due to timeouts. There is an unusually massive err
file on all machines, please note the line count @jkbk2004 FYI:
Hera: /scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_1277667/rap_clm_lake_debug_intel/err
37280183 0: slurmstepd: error: *** STEP 58007894.0 ON h10c53 CANCELLED AT 2024-04-05T17:38:55 DUE TO TIME LIMIT ***
37280184 144: fv3.exe 000000000094783A Unknown Unknown Unknown
37280185 144: fv3.exe 00000000011E4A59 Unknown Unknown Unknown
37280186 144: fv3.exe 0000000000A98E8A Unknown Unknown Unknown
37280187 144: fv3.exe 0000000000967070 Unknown Unknown Unknown
37280188 144: fv3.exe 0000000000C96201 Unknown Unknown Unknown
37280189 144: fv3.exe 000000000042E9EB MAIN__ 406 UFS.F90
37280190 144: fv3.exe 000000000042AEE2 Unknown Unknown Unknown
37280191 144: libc-2.28.so 0000153E58C0AD85 __libc_start_main Unknown Unknown
37280192 144: fv3.exe 000000000042ADEE Unknown Unknown Unknown
37280193 149: forrtl: warning (406): fort: (1): In call to RSEARCH1, an array temporary was created for argument #4
Jet: /lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_715069/rap_clm_lake_debug_intel/err
Gaea: /gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_148379/rap_clm_lake_debug_intel/err
@WenMeng-NOAA If a quick fix is not ready, we can reschedule this pr. We will move to #2145. @FernandoAndrade-NOAA @zach1221 @BrianCurtis-NOAA FYI
There were failures with the creation for the
rap_clm_lake_debug_intel
test on Hera, Gaea, and Jet due to timeouts. There is an unusually massiveerr
file on all machines, please note the line count @jkbk2004 FYI: Hera:/scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_1277667/rap_clm_lake_debug_intel/err
37280183 0: slurmstepd: error: *** STEP 58007894.0 ON h10c53 CANCELLED AT 2024-04-05T17:38:55 DUE TO TIME LIMIT *** 37280184 144: fv3.exe 000000000094783A Unknown Unknown Unknown 37280185 144: fv3.exe 00000000011E4A59 Unknown Unknown Unknown 37280186 144: fv3.exe 0000000000A98E8A Unknown Unknown Unknown 37280187 144: fv3.exe 0000000000967070 Unknown Unknown Unknown 37280188 144: fv3.exe 0000000000C96201 Unknown Unknown Unknown 37280189 144: fv3.exe 000000000042E9EB MAIN__ 406 UFS.F90 37280190 144: fv3.exe 000000000042AEE2 Unknown Unknown Unknown 37280191 144: libc-2.28.so 0000153E58C0AD85 __libc_start_main Unknown Unknown 37280192 144: fv3.exe 000000000042ADEE Unknown Unknown Unknown 37280193 149: forrtl: warning (406): fort: (1): In call to RSEARCH1, an array temporary was created for argument #4
Jet:
/lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_715069/rap_clm_lake_debug_intel/err
Gaea:
/gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_148379/rap_clm_lake_debug_intel/err
@jkbk2004 It seems to me the errors are not from inline post code. It would be difficult for me to debug this issue .
@WenMeng-NOAA If a quick fix is not ready, we can reschedule this pr. We will move to #2145. @FernandoAndrade-NOAA @zach1221 @BrianCurtis-NOAA FYI
@jkbk2004 You may move to the next PR process. Meanwhile I will investigate more.
I see the test is a debug test, but it seems to be lacking debug information ( a lot of unknown labels where we should see lines and subroutines). @jkbk2004 can your team find out which subroutine is causing the issue?
@jkbk2004 The fix recommended by @DusanJovic-NOAA has been implemented at UPP side. Both @FernandoAndrade-NOAA and I have conducted tests on WCOSS2 and Hera. This PR is ready for testing.
Hi, @WenMeng-NOAA . Can you sync up your branch here, so we can begin testing?
Hi, @WenMeng-NOAA . Can you sync up your branch here, so we can begin testing?
@zach1221 Done.
@SamuelTrahanNOAA i believe you would be responsible for IFI test issues?
EDIT: This is on Acorn TDS
First question: Would the changes in this PR cause baseline changes for IFI? I'm not sure Hera uses ifi, so it wouldn't have been tested?
If not, can you look at /lfs/h1/emc/nems/noscrub/brian.curtis/git/WenMeng-NOAA/ufs-weather-model
baseline dir = /lfs/h2/emc/nems/noscrub/emc.nems/RT/NEMSfv3gfs/develop-20240417/regional_ifi_control_intel
working dir = /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_13070/regional_ifi_control_intel
Checking test regional_ifi_control_intel results ....
Comparing dynf000.nc .....USING NCCMP......OK
Comparing dynf006.nc .....USING NCCMP......OK
Comparing phyf000.nc .....USING NCCMP......OK
Comparing phyf006.nc .....USING NCCMP......OK
Comparing PRSLEV.GrbF00 .....USING CMP......NOT IDENTICAL
Comparing PRSLEV.GrbF06 .....USING CMP......NOT IDENTICAL
Comparing NATLEV.GrbF00 .....USING CMP......NOT IDENTICAL
Comparing NATLEV.GrbF06 .....USING CMP......NOT IDENTICAL
Test regional_ifi_control_intel FAIL Tries: 2
These failed as well:
* TEST regional_ifi_decomp_intel: FAILED: UNABLE TO RUN COMPARISON
-- LOG: /lfs/h1/emc/nems/noscrub/brian.curtis/git/WenMeng-NOAA/ufs-weather-model/tests/logs/log_acorn/run_regional_ifi_decomp_intel.log
* TEST regional_ifi_2threads_intel: FAILED: UNABLE TO RUN COMPARISON
-- LOG: /lfs/h1/emc/nems/noscrub/brian.curtis/git/WenMeng-NOAA/ufs-weather-model/tests/logs/log_acorn/run_regional_ifi_2threads_intel.log
The only system that runs the IFI tests in the ufs-weather-model is Acorn. That is why it is important to run Acorn regression tests. UPP has its own regression test suite which tests IFI in the standalone UPP.
I will look at your output and get back to you soon.
The only system that runs the IFI tests in the ufs-weather-model is Acorn. That is why it is important to run Acorn regression tests. UPP has its own regression test suite which tests IFI in the standalone UPP.
I will look at your output and get back to you soon.
Would be good to have an Acorn run of the IFI tests done prior to a PR making it to commit queue, depending on if Acorn is stable, of course.
Gaea is running into issues during compile due to GLIBCXX_3.4.30 not being found, I'm not sure if this is related to the recent issues they've sent out after the maintenance.
/gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_125662/compile_rrfs_intel/err
:
cmake: /opt/cray/pe/gcc/10.3.0/snos/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by cmake)
I examined the IFI fields and they look okay to me.
Note that the IFI fields are only in the NATLEV file, not the PRSLEV file. If you're seeing differences in the PRSLEV file, then it isn't caused by IFI.
@BrianCurtis-NOAA @SamuelTrahanNOAA The updates of calculations in UPP make changes in NATLEV and PRSLEV datasets.
@WenMeng-NOAA and @SamuelTrahanNOAA it seems this has answered my question at least. I should expect what I did see in baselines changing. I'll create new baselines for the IRI tests and re-run comparisons for them as well. Thanks!
@jkbk2004 FYI, same error on develop during compile on Gaea.
The sample run was with the regional_spp_sppt_shum_skeb_intel
case
@jkbk2004 FYI, same error on develop during compile on Gaea. The sample run was with the
regional_spp_sppt_shum_skeb_intel
case
Were you able to try a different version of CMake?
@jkbk2004 FYI, same error on develop during compile on Gaea. The sample run was with the
regional_spp_sppt_shum_skeb_intel
caseWere you able to try a different version of CMake?
I'm only seeing 3.23.1 available unfortunately
Per Gaea admin suggestions, it looks like a quick sample run with export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6
beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?
Per Gaea admin suggestions, it looks like a quick sample run with
export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6
beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?
@FernandoAndrade-NOAA Could you specify which file should be updated?
Per Gaea admin suggestions, it looks like a quick sample run with
export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6
beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?@FernandoAndrade-NOAA Could you specify which file should be updated?
@WenMeng-NOAA you can put 'export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6' somewhere https://github.com/WenMeng-NOAA/ufs-weather-model/blob/upp_HR4/tests/rt.sh#L750-L751. @FernandoAndrade-NOAA please, confirm
Per Gaea admin suggestions, it looks like a quick sample run with
export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6
beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?@FernandoAndrade-NOAA Could you specify which file should be updated?
@WenMeng-NOAA you can put 'export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6' somewhere https://github.com/WenMeng-NOAA/ufs-weather-model/blob/upp_HR4/tests/rt.sh#L750-L751. @FernandoAndrade-NOAA please, confirm
I would say around line 726, just to be sure. @WenMeng-NOAA FYI
export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6
@FernandoAndrade-NOAA @jkbk2004 Added. Thanks!
Leaving a note that baseline creation was successful, however Gaea is unresponsive to the point that I can't create and copy the new baseline directory. Trying again tomorrow morning. Apologies for the delays with Gaea.
Gaea and Derecho were extremely slow yesterday and our tests weren't running. Finishing up those two machines currently.
I think we should probably skip Gaea and proceed with merging process. We can sync up the Gaea baselines later, when we're able to do so.
Fernando is reaching out to their admins again to ensure they're aware of the issue.
@WenMeng-NOAA fv3atm sub-pr is merged, can you please revert the change in .gitmodule url and update the submodule hash? Hash: https://github.com/NOAA-EMC/fv3atm/commit/da95cc428d8b626e99250fd57a4279b4980044f8
@WenMeng-NOAA fv3atm sub-pr is merged, can you please revert the change in .gitmodule url and update the submodule hash? Hash: NOAA-EMC/fv3atm@da95cc4
@zach1221 Done.
Commit Queue Requirements:
[x] Commit 'test_changes.list' from previous step
Description:
This PR aims to update revision of upp submodule which is under FV3 subcomponent. The main changes include rocky8 transition and other UPP updates to post process of UFS based global and regional applications.
Commit Message:
Priority:
High: Support global-workflow Rocky8 transition on Hera
Git Tracking
UFSWM:
None
Sub component Pull Requests:
UFSWM Blocking Dependencies:
None
Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
Testing Log: