ufs-community / ufs-weather-model

UFS Weather Model
Other
130 stars 238 forks source link

p8b (with aerosols) #1071

Closed JessicaMeixner-NOAA closed 2 years ago

JessicaMeixner-NOAA commented 2 years ago

PR Checklist

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

This PR updates all p8 tests to the p8b settings, which includes GOCART. New baselines are required due to these settings updates. New input is also required for the BM ICs and additional GOCART inputs for p8.

Co-author: @rmontuoro

Physics settings were given by the physics group and can be confirmed by @yangfanglin @RuiyuSun @JongilHan66 and others

Issue(s) addressed

Link the issues to be closed with this PR, whether in this repository, or in another repository. (Remember, issues must always be created before starting work on a PR branch!)

Testing

These changes were tested on hera and orion

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

Dependencies

If testing this branch requires non-default branches in other repositories, list them. Those branches should have matching names (ideally).

Do PRs in upstream repositories need to be merged first? If so add the "waiting for other repos" label and list the upstream PRs

junwang-noaa commented 2 years ago

@JessicaMeixner-NOAA Since the input data are available on other platforms, would you please run the RT test on those platforms? Thanks

JessicaMeixner-NOAA commented 2 years ago

@JessicaMeixner-NOAA Since the input data are available on other platforms, would you please run the RT test on those platforms? Thanks

I can run on orion, wcoss-dell, wcoss-cray and gaea. Someone else will have to run on jet and Cheyenne.

DeniseWorthen commented 2 years ago

I tried to compile on Cheyenne---I believe you still need to change the "p8b" to "p8" in the suite files themselves

<suite name="FV3_GFS_v17_coupled_p8b" version="1">
JessicaMeixner-NOAA commented 2 years ago

@DeniseWorthen thanks for finding this. I pushed the fix.

JessicaMeixner-NOAA commented 2 years ago

I have run into the following issue on wcoss-dell, see /gpfs/dell2/ptmp/Jessica.Meixner/FV3_RT/rt_46221/compile_001/err:

CMake Error at GOCART/CMakeLists.txt:68 (find_package):
  By not providing "FindMAPL.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "MAPL", but
  CMake did not find one.

  Could not find a package configuration file provided by "MAPL" with any of
  the following names:

    MAPLConfig.cmake
    mapl-config.cmake

  Add the installation prefix of "MAPL" to CMAKE_PREFIX_PATH or set
  "MAPL_DIR" to a directory containing one of the above files.  If "MAPL"
  provides a separate development package or SDK, be sure it has been
  installed.

I assumed MAPL was installed on all platforms since it was included in the ufs_common module file?

junwang-noaa commented 2 years ago

@kgerheiser @Hang-Lei-NOAA Do we have MAPL/yafyaml/gftl-shared installed on dell/jet/gaea/orion?

kgerheiser commented 2 years ago

Yes, we've been keeping the MAPL installations up-to-date as new versions of ESMF come out.

Hang-Lei-NOAA commented 2 years ago

Yes.

On Tue, Mar 1, 2022 at 3:24 PM Kyle Gerheiser @.***> wrote:

Yes, we've been keeping the MAPL installations up-to-date as new versions of ESMF come out.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/pull/1071#issuecomment-1055827679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFDWXEW5DL2UJHMGQDLU5Z4ITANCNFSM5PSLOZKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

JessicaMeixner-NOAA commented 2 years ago

I've done a module spider on maple on mars and Venus, and neither have mapl/2.11.0-esmf-8.2.1b04 (also I get an issue trying to load jasper/2.0.22) as well. Am I doing something wrong? I'm basically just loading on the login node after a module purge: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/blob/feature/p8b_aero/modulefiles/ufs_wcoss_dell_p3#L9-L19 and then https://github.com/JessicaMeixner-NOAA/ufs-weather-model/blob/feature/p8b_aero/modulefiles/ufs_common#L3-L23 ?

JessicaMeixner-NOAA commented 2 years ago

I do have the correct modules on hera, orion and gaea it's just WCOSS-dell that I'm having the MAPL issues.

Hang-Lei-NOAA commented 2 years ago

I will check this. Thanks

On Tue, Mar 1, 2022 at 3:42 PM Jessica Meixner @.***> wrote:

I do have the correct modules on hera, orion and gaea it's just WCOSS-dell that I'm having the MAPL issues.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/pull/1071#issuecomment-1055840438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFBCXW23KYYCA4V732DU5Z6LFANCNFSM5PSLOZKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

DeniseWorthen commented 2 years ago

@JessicaMeixner-NOAA On cheyenne.intel, all p8 jobs compiled.

The cpld_control_c384_p8 fails at startup with MPT: shepherd terminated: r14i1n25.ib0.cheyenne.ucar.edu - job aborting. This can really be almost anything in my experience but it is usually memory related.

The cpld_bmark_p8 also failed, though I can't find any error message.

The cpld_bmark_p7 did not reproduce the baseline. Shouldn't this test still reproduce? Has it reproduced in your tests?

The cpld_control_p7_rrtmgp did reproduce.

JessicaMeixner-NOAA commented 2 years ago

@DeniseWorthen thanks for reporting these issues. I forgot about cpld_bmark_p7, it reproduced except that you can't actually point to either old or new ICs. So if you un-did the changes in tests/fv3_conf/cpld_control_run.IN FV3_IC=@[INPUTDATA_ROOT_BMIC]/${SYEAR}${SMONTH}${SDAY}${SHOUR}/gfs_p8a/@[ATMRES]_L@[NPZ]/INPUT to the old line, then it should still reproduce. That being said, it has been a minute since I ran that so I can double check that on orion.

Likely both the cpld_control_c384_p8 and cpld_bmark_p8 are memory issues. I'll check the Cheyenne settings and see if I can push a change to the resources that might help.

Comparing with a new baseline on (gaea, wcoss-*, orion, I did realize I didn't update the control_p8 with a recent settings change so I'll soon push an update for that as well).

JessicaMeixner-NOAA commented 2 years ago

For the FV3_IC and the p7 test reproducing, I can make that a variable but I guess decided that wasn't necessary. If others think it is, I can add in a variable.

DeniseWorthen commented 2 years ago

I suspect the issue w/ the p7 test is also from #export DZ_MIN=2 in default_vars. It is not re-set to 2 in the p7 test.

JessicaMeixner-NOAA commented 2 years ago

Thanks for the heads up, it seems it was longer than I remember from when I checked this test. I'll add confirming this to my to-do list.

JessicaMeixner-NOAA commented 2 years ago

@DeniseWorthen I pushed a recommended change for memory for Cheyenne - let me know if that doesn't work and we'll try something else.

DeniseWorthen commented 2 years ago

I think setting dz_min back to 2 for cpld_bmark_p7 does allow the baseline to reproduce. I tested by just changing the input.nml and then comparing the mediator restart file against the baseline.

JessicaMeixner-NOAA commented 2 years ago

@Hang-Lei-NOAA @kgerheiser -- I was able to load the MAPL module on WCOSS-Dell today and am assuming it's ready to use. Thank you!

@DeniseWorthen @junwang-noaa to get the cpld_bmark_p7 I need to add as Denise mentioned above DZ_MIN=2. Also I needed to add WRITE_NSFLIP=.false. (this is the default value, if I'm not mistaken) and change in the input directory to use the p7 atm input files. Please let me know what of these changes you'd like me to maintain other than of course adding DZ_MIN=2. The log file and changes can be seen locally on orion here: /work/noaa/marine/jmeixner/p8b/ufs-p7test

DeniseWorthen commented 2 years ago

@JessicaMeixner-NOAA We do want to maintain the current bmark_p7 test reproducing the current baseline. Once the P8 prototype is finalized we will drop the test.

It think the simplest solution would be to add a variable such as P7IC = .false. as a default, use that variable in the fv3_conf/cpld_control_run.IN to control the source directory and then set it true for the cpld_bmark_p7 test.

For Cheyenne.intel, the c384 test was able to run with the changed resources but not the bmark_p8. From experience I'm not sure that the c384 working this time means it will work a second time. I will test again after you push the dz_min and the IC fix.

JessicaMeixner-NOAA commented 2 years ago

I pushed changes for the p7 test and an update to the p8 Cheyenne settings

JessicaMeixner-NOAA commented 2 years ago

I'm creating the baseline on wcoss-dell and running the full set after the updates on orion against the existing baselines so we'll have all that information.

DeniseWorthen commented 2 years ago

@JessicaMeixner-NOAA I've tested with the latest changes. The p7 bmark test now reproduces, so thanks for that fix.

The bmark_p8 test runs but it takes almost 26 minutes, which is probably cutting it too fine for our 30min test limit. The c384_p8 test also runs, but clocks in at about 28 minutes.

However, now the c192 p8 test failed w/ the MPT shepherd error. I went back and confirmed that one had previously run. I think this shows a case that when we're close to some limit, it sometimes works and sometimes doesn't.

JessicaMeixner-NOAA commented 2 years ago

@DeniseWorthen would increasing the time limits for those two tests be acceptable? Alternatively, I can increase the threads again for additional memory to be available and could attempt to further balance the test if there is a machine I have access to has similar performance to Cheyenne. Does Cheyenne provide any memory usage when running that you could share?

DeniseWorthen commented 2 years ago

We'll need to discuss at the morning tag-up. We may need to turn the tests off on Cheyenne.

I'm not sure what sort of memory usage report you might be referring to. Is there something I should look for---information that can be retrieved on other platforms?

JessicaMeixner-NOAA commented 2 years ago

I know for example that WCOSS-DELL in the out file you can get things like:

Resource usage summary:

    CPU time :                                   24514.00 sec.
    Max Memory :                                 89629 MB
    Average Memory :                             60212.12 MB

but not all machines have this type of information.

JessicaMeixner-NOAA commented 2 years ago

@DeniseWorthen I have worked on other platforms to see if I can figure out other combinations to make the model a little faster. I can bump up to 8x8 (from 6x8) for the atm model and take a few cores away from the other components but you don't see a huge speed up (although it does in fact speed up a little).
On hera for example: Current 6x8 settings (28 nodes):

  0: The total amount of wall time                        = 999.551893
  0: The total amount of time in user mode                = 1450.440496

8x8 reducing other components (28 nodes):

  TASKS_cpl_bmrk_aero=560; TPN_cpl_bmrk_aero=20; INPES_cpl_bmrk_aero=8; JNPES_cpl_bmrk_aero=8
  THRD_cpl_bmrk_aero=2; WPG_cpl_bmrk_aero=24; MPB_cpl_bmrk_aero="0 383"; APB_cpl_bmrk_aero="0 407"
  CHM_cpl_bmrk_aero="0 383"; OPB_cpl_bmrk_aero="384 483"; IPB_cpl_bmrk_aero="484 507"; WPB_cpl_bmrk_aero="508 559"
  NPROC_ICE_cpl_bmrk_aero=24

Timing:

  0: The total amount of wall time                        = 952.125671
  0: The total amount of time in user mode                = 1287.446964

8x12, reducing other components (39 nodes):

  TASKS_cpl_bmrk_aero=780; TPN_cpl_bmrk_aero=20; INPES_cpl_bmrk_aero=12; JNPES_cpl_bmrk_aero=8
  THRD_cpl_bmrk_aero=2; WPG_cpl_bmrk_aero=24; MPB_cpl_bmrk_aero="0 575"; APB_cpl_bmrk_aero="0 599"
  CHM_cpl_bmrk_aero="0 575"; OPB_cpl_bmrk_aero="600 699"; IPB_cpl_bmrk_aero="700 719"; WPB_cpl_bmrk_aero="720 779"
  NPROC_ICE_cpl_bmrk_aero=20

Timing:

  0: The total amount of wall time                        = 754.271854
  0: The total amount of time in user mode                = 983.948570

On Cheyenne since you need to use 3-4 threads for memory, the node count will get very high very quickly, but thought this might be helpful as decisions are made on the best paths forward.

The dell logs were added, note I had issues with Dell the past several days not just for this PR where a different test or compile would fail each time although I had no troubles creating the baseline. All the tests do however pass.

Please let me know if there are other outstanding issues I can help address.

DeniseWorthen commented 2 years ago

@JessicaMeixner-NOAA Thanks for those tests. We have been having various compile issues w/ Dell which we believe we now have a solution for. That may be the cause of your variable test results.

For Cheyenne, @jkbk2004 was going to try to do some testing on Cheyenne for your PR.

jkbk2004 commented 2 years ago

@JessicaMeixner-NOAA @DeniseWorthen Yeah, submitted RT-intel runs on Cheyenne. will keep you posted.

jkbk2004 commented 2 years ago

@JessicaMeixner-NOAA @DeniseWorthen @junwang-noaa on cheyenne/intel: some check failures. most expensive ones are around 20 minutes. rt-1071-fails rt-1071-timings

JessicaMeixner-NOAA commented 2 years ago

@jkbk2004 these tests are expected to have different answers than the existing baselines. Is this compared against the existing baseline or a new one?

jkbk2004 commented 2 years ago

@JessicaMeixner-NOAA do I need to test with develop-20220228? I tested with develop-20220224.

JessicaMeixner-NOAA commented 2 years ago

This PR will change baselines for the tests that failed, so if you're testing against the existing baseline, your results are as expected.

junwang-noaa commented 2 years ago

Please merge to the top of develop branches, and change the BL_DATE to 20220304 in rt.sh. Thanks

JessicaMeixner-NOAA commented 2 years ago

20220304

Done!

junwang-noaa commented 2 years ago

@JessicaMeixner-NOAA Please create new baseline and run RT on wcoss. Please let me know if you have issue copying new baseline. Thanks.

JessicaMeixner-NOAA commented 2 years ago

@junwang-noaa are we anticipating more updates or should I start creating baselines/running things on wcoss?

BrianCurtis-NOAA commented 2 years ago

Automated RT Failure Notification Machine: cheyenne Compiler: intel Job: BL [BL] Repo location: /glade/scratch/dtcufsrt/autort/tests/auto/pr/867250832/20220304133011/ufs-weather-model [BL] Error: Test cpld_control_c192_p8 006 failed in run_test failed Please make changes and add the following label back: cheyenne-intel-BL

junwang-noaa commented 2 years ago

@junwang-noaa are we anticipating more updates or should I start creating baselines/running things on wcoss?

No. We have all the code changes. You can start creating baselines on wcoss.

junwang-noaa commented 2 years ago

@jkbk2004 Jong, would you please take a look at the cpld_control_c192_p8 RT test on Cheyenne? Thanks

jkbk2004 commented 2 years ago

@junwang-noaa I am checking.

BrianCurtis-NOAA commented 2 years ago

Automated RT Failure Notification Machine: jet Compiler: intel Job: BL [BL] Repo location: /lfs4/HFIP/h-nems/emc.nemspara/autort/pr/867250832/20220304201513/ufs-weather-model [BL] Error: Test compile_001 failed in run_compile failed [BL] Error: Test compile_002 failed in run_compile failed Please make changes and add the following label back: jet-intel-BL

junwang-noaa commented 2 years ago

@kgerheiser Do we have MAPL installed on jet? I saw error:

Force 32-bit build for GOCART CMake Error at GOCART/CMakeLists.txt:69 (include): include could not find requested file:

mapl_acg

CMake Error at GOCART/ESMF/Aerosol_GridComp/CMakeLists.txt:9 (mapl_acg): Unknown CMake command "mapl_acg".

kgerheiser commented 2 years ago

Yep, it's there.

module load esmf/8.2.1b04
module load mapl/2.11.0-esmf-8.2.1b04
junwang-noaa commented 2 years ago

@rmontuoro Do you know if the mapl_acg is included in mapl/2.11.0?

BrianCurtis-NOAA commented 2 years ago

Automated RT Failure Notification Machine: orion Compiler: intel Job: BL [BL] Repo location: /work/noaa/nems/emc.nemspara/autort/pr/867250832/20220304141534/ufs-weather-model [BL] Baseline creation and move successful [RT] Repo location: /work/noaa/nems/emc.nemspara/autort/pr/867250832/20220304164615/ufs-weather-model [RT] Error: Test hafs_regional_telescopic_2nests_atm 104 failed in run_test failed Please make changes and add the following label back: orion-intel-BL

junwang-noaa commented 2 years ago

@rmontuoro @JessicaMeixner-NOAA would you please confirm the mapl_acg is in mapl/2.11.0? According to @kgerheiser The MAPL/2.11.0 is installed on jet.

jkbk2004 commented 2 years ago

@JessicaMeixner-NOAA @junwang-noaa cpld_control_c192_p8 crashes on cheyenne. MPT: shepherd terminated: r1i5n7.ib0.cheyenne.ucar.edu - job aborting

JessicaMeixner-NOAA commented 2 years ago

Can someone help me move baselines on WCOSS to the official area? On WCOSS-cray the baseline is here: /gpfs/hps3/stmp/Jessica.Meixner/FV3_RT/REGRESSION_TEST On WCOSS-Dell the baseline is here: /gpfs/dell2/stmp/Jessica.Meixner/FV3_RT/REGRESSION_TEST

I've emailed the WCOSS Helpdesk to reset my password, but I do not know how long that will take. Note, it took me about 3+ tries over the weekend to finally get the WCOSS-DELL baseline to generate w/out issues. (It was a different issue each time and unrelated to the updates made here from what I could tell).

junwang-noaa commented 2 years ago

I am copying the data on venus. On surge, I got:

SURGE-slogin1 > ls /gpfs/hps3/stmp/Jessica.Meixner/FV3_RT/REGRESSION_TEST ls: cannot access /gpfs/hps3/stmp/Jessica.Meixner/FV3_RT/REGRESSION_TEST: No such file or directory