ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

Update upp submodule #2213

Closed WenMeng-NOAA closed 3 weeks ago

WenMeng-NOAA commented 1 month ago

Commit Queue Requirements:

Commit Message:

* UFSWM - Update inline post
  * FV3 - Update upp submodule for inline post

Priority:

Sub component Pull Requests:

UFSWM Blocking Dependencies:

Input data Changes:

Library Changes/Upgrades:


Testing Log:

jkbk2004 commented 1 month ago

@WenMeng-NOAA @FernandoAndrade-NOAA can we schedule to work on this pr anytime this week?

WenMeng-NOAA commented 1 month ago

@WenMeng-NOAA @FernandoAndrade-NOAA can we schedule to work on this pr anytime this week?

@jkbk2004 That would be great! Please let me know any actions from my end.

FernandoAndrade-NOAA commented 1 month ago

It looks like the test_changes.list was overwritten during the sync, I'm recommitting that as those changes were confirmed to be expected from this update.

FernandoAndrade-NOAA commented 1 month ago

@zach1221 @BrianCurtis-NOAA FYI getting started on testing this PR

WenMeng-NOAA commented 1 month ago

It looks like the test_changes.list was overwritten during the sync, I'm recommitting that as those changes were confirmed to be expected from this update.

@FernandoAndrade-NOAA Yes, during my syncing process this morning, I was not sure which version of test_changes.list was needed.

FernandoAndrade-NOAA commented 1 month ago

It looks like the test_changes.list was overwritten during the sync, I'm recommitting that as those changes were confirmed to be expected from this update.

@FernandoAndrade-NOAA Yes, during my syncing process this morning, I was not sure which version of test_changes.list was needed.

No worries! We should be good to go now, thanks.

FernandoAndrade-NOAA commented 1 month ago

There were failures with the creation for the rap_clm_lake_debug_intel test on Hera, Gaea, and Jet due to timeouts. There is an unusually massive err file on all machines, please note the line count @jkbk2004 FYI: Hera: /scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_1277667/rap_clm_lake_debug_intel/err

37280183   0: slurmstepd: error: *** STEP 58007894.0 ON h10c53 CANCELLED AT 2024-04-05T17:38:55 DUE TO TIME LIMIT ***
37280184 144: fv3.exe            000000000094783A  Unknown               Unknown  Unknown
37280185 144: fv3.exe            00000000011E4A59  Unknown               Unknown  Unknown
37280186 144: fv3.exe            0000000000A98E8A  Unknown               Unknown  Unknown
37280187 144: fv3.exe            0000000000967070  Unknown               Unknown  Unknown
37280188 144: fv3.exe            0000000000C96201  Unknown               Unknown  Unknown
37280189 144: fv3.exe            000000000042E9EB  MAIN__                    406  UFS.F90
37280190 144: fv3.exe            000000000042AEE2  Unknown               Unknown  Unknown
37280191 144: libc-2.28.so       0000153E58C0AD85  __libc_start_main     Unknown  Unknown
37280192 144: fv3.exe            000000000042ADEE  Unknown               Unknown  Unknown
37280193 149: forrtl: warning (406): fort: (1): In call to RSEARCH1, an array temporary was created for argument #4

Jet: /lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_715069/rap_clm_lake_debug_intel/err

Gaea: /gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_148379/rap_clm_lake_debug_intel/err

jkbk2004 commented 1 month ago

@WenMeng-NOAA If a quick fix is not ready, we can reschedule this pr. We will move to #2145. @FernandoAndrade-NOAA @zach1221 @BrianCurtis-NOAA FYI

WenMeng-NOAA commented 1 month ago

There were failures with the creation for the rap_clm_lake_debug_intel test on Hera, Gaea, and Jet due to timeouts. There is an unusually massive err file on all machines, please note the line count @jkbk2004 FYI: Hera: /scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_1277667/rap_clm_lake_debug_intel/err

37280183   0: slurmstepd: error: *** STEP 58007894.0 ON h10c53 CANCELLED AT 2024-04-05T17:38:55 DUE TO TIME LIMIT ***
37280184 144: fv3.exe            000000000094783A  Unknown               Unknown  Unknown
37280185 144: fv3.exe            00000000011E4A59  Unknown               Unknown  Unknown
37280186 144: fv3.exe            0000000000A98E8A  Unknown               Unknown  Unknown
37280187 144: fv3.exe            0000000000967070  Unknown               Unknown  Unknown
37280188 144: fv3.exe            0000000000C96201  Unknown               Unknown  Unknown
37280189 144: fv3.exe            000000000042E9EB  MAIN__                    406  UFS.F90
37280190 144: fv3.exe            000000000042AEE2  Unknown               Unknown  Unknown
37280191 144: libc-2.28.so       0000153E58C0AD85  __libc_start_main     Unknown  Unknown
37280192 144: fv3.exe            000000000042ADEE  Unknown               Unknown  Unknown
37280193 149: forrtl: warning (406): fort: (1): In call to RSEARCH1, an array temporary was created for argument #4

Jet: /lfs4/HFIP/h-nems/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_715069/rap_clm_lake_debug_intel/err

Gaea: /gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_148379/rap_clm_lake_debug_intel/err

@jkbk2004 It seems to me the errors are not from inline post code. It would be difficult for me to debug this issue .

WenMeng-NOAA commented 1 month ago

@WenMeng-NOAA If a quick fix is not ready, we can reschedule this pr. We will move to #2145. @FernandoAndrade-NOAA @zach1221 @BrianCurtis-NOAA FYI

@jkbk2004 You may move to the next PR process. Meanwhile I will investigate more.

BrianCurtis-NOAA commented 1 month ago

I see the test is a debug test, but it seems to be lacking debug information ( a lot of unknown labels where we should see lines and subroutines). @jkbk2004 can your team find out which subroutine is causing the issue?

WenMeng-NOAA commented 1 month ago

@jkbk2004 The fix recommended by @DusanJovic-NOAA has been implemented at UPP side. Both @FernandoAndrade-NOAA and I have conducted tests on WCOSS2 and Hera. This PR is ready for testing.

zach1221 commented 4 weeks ago

Hi, @WenMeng-NOAA . Can you sync up your branch here, so we can begin testing?

WenMeng-NOAA commented 4 weeks ago

Hi, @WenMeng-NOAA . Can you sync up your branch here, so we can begin testing?

@zach1221 Done.

BrianCurtis-NOAA commented 4 weeks ago

@SamuelTrahanNOAA i believe you would be responsible for IFI test issues?

EDIT: This is on Acorn TDS

First question: Would the changes in this PR cause baseline changes for IFI? I'm not sure Hera uses ifi, so it wouldn't have been tested?

If not, can you look at /lfs/h1/emc/nems/noscrub/brian.curtis/git/WenMeng-NOAA/ufs-weather-model

baseline dir = /lfs/h2/emc/nems/noscrub/emc.nems/RT/NEMSfv3gfs/develop-20240417/regional_ifi_control_intel
working dir  = /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_13070/regional_ifi_control_intel
Checking test regional_ifi_control_intel results ....
 Comparing dynf000.nc .....USING NCCMP......OK
 Comparing dynf006.nc .....USING NCCMP......OK
 Comparing phyf000.nc .....USING NCCMP......OK
 Comparing phyf006.nc .....USING NCCMP......OK
 Comparing PRSLEV.GrbF00 .....USING CMP......NOT IDENTICAL
 Comparing PRSLEV.GrbF06 .....USING CMP......NOT IDENTICAL
 Comparing NATLEV.GrbF00 .....USING CMP......NOT IDENTICAL
 Comparing NATLEV.GrbF06 .....USING CMP......NOT IDENTICAL
Test regional_ifi_control_intel FAIL Tries: 2

These failed as well:

* TEST regional_ifi_decomp_intel: FAILED: UNABLE TO RUN COMPARISON
-- LOG: /lfs/h1/emc/nems/noscrub/brian.curtis/git/WenMeng-NOAA/ufs-weather-model/tests/logs/log_acorn/run_regional_ifi_decomp_intel.log
* TEST regional_ifi_2threads_intel: FAILED: UNABLE TO RUN COMPARISON
-- LOG: /lfs/h1/emc/nems/noscrub/brian.curtis/git/WenMeng-NOAA/ufs-weather-model/tests/logs/log_acorn/run_regional_ifi_2threads_intel.log
SamuelTrahanNOAA commented 4 weeks ago

The only system that runs the IFI tests in the ufs-weather-model is Acorn. That is why it is important to run Acorn regression tests. UPP has its own regression test suite which tests IFI in the standalone UPP.

I will look at your output and get back to you soon.

BrianCurtis-NOAA commented 4 weeks ago

The only system that runs the IFI tests in the ufs-weather-model is Acorn. That is why it is important to run Acorn regression tests. UPP has its own regression test suite which tests IFI in the standalone UPP.

I will look at your output and get back to you soon.

Would be good to have an Acorn run of the IFI tests done prior to a PR making it to commit queue, depending on if Acorn is stable, of course.

FernandoAndrade-NOAA commented 4 weeks ago

Gaea is running into issues during compile due to GLIBCXX_3.4.30 not being found, I'm not sure if this is related to the recent issues they've sent out after the maintenance.

/gpfs/f5/epic/scratch/Fernando.Andrade-maldonado/RT_RUNDIRS/Fernando.Andrade-maldonado/FV3_RT/rt_125662/compile_rrfs_intel/err:

cmake: /opt/cray/pe/gcc/10.3.0/snos/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by cmake)
SamuelTrahanNOAA commented 4 weeks ago

I examined the IFI fields and they look okay to me.

Note that the IFI fields are only in the NATLEV file, not the PRSLEV file. If you're seeing differences in the PRSLEV file, then it isn't caused by IFI.

WenMeng-NOAA commented 4 weeks ago

@BrianCurtis-NOAA @SamuelTrahanNOAA The updates of calculations in UPP make changes in NATLEV and PRSLEV datasets.

BrianCurtis-NOAA commented 4 weeks ago

@WenMeng-NOAA and @SamuelTrahanNOAA it seems this has answered my question at least. I should expect what I did see in baselines changing. I'll create new baselines for the IRI tests and re-run comparisons for them as well. Thanks!

FernandoAndrade-NOAA commented 4 weeks ago

@jkbk2004 FYI, same error on develop during compile on Gaea. The sample run was with the regional_spp_sppt_shum_skeb_intel case

BrianCurtis-NOAA commented 4 weeks ago

@jkbk2004 FYI, same error on develop during compile on Gaea. The sample run was with the regional_spp_sppt_shum_skeb_intel case

Were you able to try a different version of CMake?

FernandoAndrade-NOAA commented 4 weeks ago

@jkbk2004 FYI, same error on develop during compile on Gaea. The sample run was with the regional_spp_sppt_shum_skeb_intel case

Were you able to try a different version of CMake?

I'm only seeing 3.23.1 available unfortunately

FernandoAndrade-NOAA commented 4 weeks ago

Per Gaea admin suggestions, it looks like a quick sample run with export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6 beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?

WenMeng-NOAA commented 4 weeks ago

Per Gaea admin suggestions, it looks like a quick sample run with export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6 beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?

@FernandoAndrade-NOAA Could you specify which file should be updated?

jkbk2004 commented 4 weeks ago

Per Gaea admin suggestions, it looks like a quick sample run with export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6 beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?

@FernandoAndrade-NOAA Could you specify which file should be updated?

@WenMeng-NOAA you can put 'export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6' somewhere https://github.com/WenMeng-NOAA/ufs-weather-model/blob/upp_HR4/tests/rt.sh#L750-L751. @FernandoAndrade-NOAA please, confirm

FernandoAndrade-NOAA commented 4 weeks ago

Per Gaea admin suggestions, it looks like a quick sample run with export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6 beforehand resolved the error, @WenMeng-NOAA could you add this line to your PR while I run baseline creation and full RTs?

@FernandoAndrade-NOAA Could you specify which file should be updated?

@WenMeng-NOAA you can put 'export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6' somewhere https://github.com/WenMeng-NOAA/ufs-weather-model/blob/upp_HR4/tests/rt.sh#L750-L751. @FernandoAndrade-NOAA please, confirm

I would say around line 726, just to be sure. @WenMeng-NOAA FYI

WenMeng-NOAA commented 4 weeks ago

export LD_PRELOAD=/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6

@FernandoAndrade-NOAA @jkbk2004 Added. Thanks!

FernandoAndrade-NOAA commented 4 weeks ago

Leaving a note that baseline creation was successful, however Gaea is unresponsive to the point that I can't create and copy the new baseline directory. Trying again tomorrow morning. Apologies for the delays with Gaea.

zach1221 commented 3 weeks ago

Gaea and Derecho were extremely slow yesterday and our tests weren't running. Finishing up those two machines currently.

zach1221 commented 3 weeks ago

I think we should probably skip Gaea and proceed with merging process. We can sync up the Gaea baselines later, when we're able to do so.

Fernando is reaching out to their admins again to ensure they're aware of the issue.

zach1221 commented 3 weeks ago

@WenMeng-NOAA fv3atm sub-pr is merged, can you please revert the change in .gitmodule url and update the submodule hash? Hash: https://github.com/NOAA-EMC/fv3atm/commit/da95cc428d8b626e99250fd57a4279b4980044f8

WenMeng-NOAA commented 3 weeks ago

@WenMeng-NOAA fv3atm sub-pr is merged, can you please revert the change in .gitmodule url and update the submodule hash? Hash: NOAA-EMC/fv3atm@da95cc4

@zach1221 Done.