ufs-community / ufs-mrweather-app

UFS Medium-Range Weather Application
Other
23 stars 23 forks source link

Update externals to address chgres_cube.exe problems on cheyenne and orion (and possibly gaea) #200

Closed climbfuji closed 4 years ago

climbfuji commented 4 years ago

Downgrading the Intel compiler from 19.x.y to 18.m.n, removing all sorts of unnecessary/incorrect (for the UFS) setttings for Cheyenne in CIME and adjusting the number of nodes and tasks per node for case.chgres fixes the segmentation faults of chgres_cube.exe that we got on Cheyenne.

This was tested by running the UFS MRWeather App regression test suite five times in a row over the weekend on Cheyenne, and in "manual" tests (running the workflow end to end) on Orion - for the latter only once for C768, but it worked.

The changes for the ufs-weather-model have been merged and the repo has been retagged as ufs-v1.1.0, see https://github.com/ufs-community/ufs-weather-model/pull/204:

commit 5bea16b6d41d810dc2e45cba0fa3841f45ea7c7a (HEAD -> release/public-v1, tag: ufs-v1.1.0, origin/release/public-v1)
Author: Dom Heinzeller <dom.heinzeller@icloud.com>
Date:   Mon Sep 21 14:59:48 2020 -0600

    release/public-v1: update ccpp-physics (fix unicode errors), downgrade Intel compiler (#204)

    * Update .gitmodules and submodule pointer for fv3atm
    * Revert Intel compiler back from 19.x.y to 18.m.n on Cheyenne and Gaea
    * Revert change to .gitmodules and update submodule pointer for fv3atm

A corresponding PR to update the documentation for installing the libraries has been merged into NCEPLIBS-external and the repo was retagged: https://github.com/NOAA-EMC/NCEPLIBS-external/pull/70.

A corresponding PR for CIME has been issued here, https://github.com/ESMCI/cime/pull/3713/files, with some questions right at the top about what is the correct target branch. This must be merged now that the ufs-weather-model has been updated.

A corresponding PR for FV3GFS_Interface has been issued here, https://github.com/ESCOMP/FV3GFS_interface/pull/12, again with questions about the correct target branch. This has been tested to work on Cheyenne for the full suite of regression tests, on gaea for C96 and C768, and on orion for C96 and C768 in end-to-end workflow tests.

climbfuji commented 4 years ago

@ligiabernardet @uturuncoglu @panll @jedwards4b please help me to push the changes that fix the chgres_cube.exe problem across the finish line.

uturuncoglu commented 4 years ago

@climbfuji please see my comment on the https://github.com/ESCOMP/FV3GFS_interface/pull/12

ligiabernardet commented 4 years ago

@climbfuji During your tests performed after these updates, did you experience any compilation problems such as the one reported in https://github.com/ufs-community/ufs-mrweather-app/issues/169?

climbfuji commented 4 years ago

@climbfuji During your tests performed after these updates, did you experience any compilation problems such as the one reported in #169?

No, those were gone after @uturuncoglu reverted to the MRW 1.0 buildlib.

uturuncoglu commented 4 years ago

@climbfuji @ligiabernardet i merged the PR. Do you want me to make a PR for the app?

climbfuji commented 4 years ago

@climbfuji @ligiabernardet i merged the PR. Do you want me to make a PR for the app?

We need to update CIME as well. Maybe create tags for all submodules = externals that don't have them yet (CIME, fv3gfs_interface, NEMS_interface). Then update Externals.cfg?

climbfuji commented 4 years ago

Update 20200922/1: the FV3GFS_Interface was merged, the new hash to use (in form of a tag?) in the MRW App is:

commit 523f22e62c4fdd3713ac2ab3755af642f62c9278 (HEAD, origin/ufs-release-v1.1)
Merge: 0bd4233 0330e9c
Author: Ufuk Turunçoğlu <turuncu@ucar.edu>
Date:   Tue Sep 22 10:42:58 2020 -0600

    Merge pull request #12 from climbfuji/adjust_chgres_nodes_task_per_node

    release/public-v1: adjust tasks per node and number of nodes for chgres pre-processing step; return to old version of buildlib used in UFS MR-Weather App release version 1.0
uturuncoglu commented 4 years ago

@climbfuji do we ready to tag? Lets discuss it in the call today and based on our definition I could make the required changes.

climbfuji commented 4 years ago

@climbfuji do we ready to tag? Lets discuss it in the call today and based on our definition I could make the required changes.

Sounds good to me. We can also simply forward the hashes in the app (after merging CIME/creating the branch) and ask others (@panll @llpcarson) to test it before tagging.

climbfuji commented 4 years ago

Update 20200922/2: the CIME PR was merged into branch ufs_release_v1.1 (forced update). This is the new hash to use:

commit cac6a3e17d80993abcd0811be093023c3b607f0f (HEAD, origin/ufs_release_v1.1)
Merge: e0dcc5902 370072fb0
Author: Bill Sacks <sacks@ucar.edu>
Date:   Tue Sep 22 14:32:22 2020 -0600

    Merge pull request #3713 from climbfuji/cheyenne_intel_18

    ufs_release_v1.1: remove unnecessary/incorrect configuration options for Cheyenne for the UFS; downgrade Intel 19.x.y to 18.m.n on Cheyenne, Gaea, Orion

    ...
uturuncoglu commented 4 years ago

@climbfuji @ligiabernardet I'll update the app and let you know.