ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

WCOSS2: pio install does not seem to support pnetcdf #2232

Open BrianCurtis-NOAA opened 1 month ago

BrianCurtis-NOAA commented 1 month ago

Description

PR #2145 brought in a change where CICE switched to use pnetcdf in PIO instead of hdf5. This worked on all machines except WCOSS2.

This leads us to believe that the PIO install on WCOSS2 was not built with proper pnetcdf support.

Efforts are ongoing trying to determine the specific of any build differences between spack-stack and the hpc-stack on WCOSS2.

To Reproduce:

Run cpld_control_gfsv17 intel RT with develop branch of UFSWM (From PR #2145 ) on WCOSS2 dev machine

Needs alongside solving of this issue

  1. Remove temporary workaround in default_vars.sh for WCOSS2
junwang-noaa commented 1 month ago

@HangLei-NOAA would you please check the library on wcoss2 and install a test version of netcdf with pio on acorn for us to test? Thanks

BrianCurtis-NOAA commented 1 month ago

Apologies for missing to post here, but there is an install Hang has made at: /lfs/h2/emc/eib/save/Hang.Lei/forgdit/nco_wcoss2/install

DeniseWorthen commented 1 month ago

@BrianCurtis-NOAA Are you testing Hang's install?

BrianCurtis-NOAA commented 1 month ago

Yes. I will do more today once I get todays PR started.

BrianCurtis-NOAA commented 4 weeks ago

using the updated libraries: /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_63569/

MPICH ERROR [Rank 170] [job id 0882c1cf-0967-46e2-88a1-63f55d8cd95f] [Wed Apr 17 17:48:46 2024] [nid001019] - Abort(128) (rank 170 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 170
 (abort_ice)ABORTED:
 (abort_ice) called from ice_pio.F90
 (abort_ice) line number          223
 (abort_ice) error =
 (ice_pio_check)Unknown Error, (ice_pio_init) ERROR: Failed to create file ./his
 tory/iceh_ic.2021-03-22-21600.nc
Hang-Lei-NOAA commented 4 weeks ago

@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test. If I still get the failure, I will install a new pnetcdf library. Currently, we are using the system installed pnetcdf library.

BrianCurtis-NOAA commented 4 weeks ago

@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test. If I still get the failure, I will install a new pnetcdf library. Currently, we are using the system installed pnetcdf library.

@Hang-Lei-NOAA the develop branch of ufs-weather-model has the issue, use:

./rt.sh -a <ACCNR> -n "cpld_control_gfsv17 intel"

but first remove (or comment out): https://github.com/ufs-community/ufs-weather-model/blob/47c00995706380f9827a7e98bd242ffc6e4a7f01/tests/default_vars.sh#L781-L784

Hang-Lei-NOAA commented 3 weeks ago

@BrianCurtis-NOAA I have checked, and get this fixed with rebuild the netcdf/4.9.2 with pnetcdf. /lfs/h2/emc/eib/noscrub/Hang.Lei/works/brianufs/tests/logs/log_wcoss2/run_cpld_control_gfsv17_intel.log

Please use: /lfs/h2/emc/eib/noscrub/Hang.Lei/works/brianufs/modulefiles/ufs_wcoss2.intel.lua

Please copy it soon, I will do more sensitivity tests on the UFS to use system installed libs this afternoon after 3pm. Thanks.

BrianCurtis-NOAA commented 3 weeks ago

@Hang-Lei-NOAA I can confirm that your lua file works for that test. Please proceed with getting these adjustments made on WCOSS2 Dev.

Hang-Lei-NOAA commented 3 weeks ago

@junwang-noaa @BrianCurtis-NOAA Bongi has set up an installation on acorn systems installation, please load the intel environment. You will see: For now, here is what is deployed to production on Acorn:

$ module -t avail 2>&1 | grep -- "-C/." esmf-C/8.6.0 fms-C/2023.04 hdf5-C/1.14.0 mapl-C/2.40.3 netcdf-C/4.9.2 pio-C/2.5.10 pnetcdf-C/1.12.2

please test them and let me know if any issues found. Thanks

BrianCurtis-NOAA commented 3 weeks ago

I've been able to load those modules and build/compile a test case OK. I am running the full suite now on Acorn using WCOSS2 setup. I will pass along the results as soon as I can.

BrianCurtis-NOAA commented 3 weeks ago

@Hang-Lei-NOAA Bongi needed Acorn for other things today, so I was only able to run a subset of tests with the -C libraries but they included tests for cpld, control, regional, 2threads, mpi, restarts, p8, gfsv17, decomp, (the problem case from before) all with success (PASS). Im comfortable saying the -C libraries are OK to use on WCOSS2.

brian.curtis@alogin03:/lfs/h1/emc/nems/noscrub/brian.curtis/git/ufs-community/ufs-weather-model/tests/log
s/log_acorn> grep -ril PASS ./rt*.log
./rt_control_2threads_p8_intel.log
./rt_control_c192_intel.log
./rt_control_c384gdas_intel.log
./rt_control_c384_intel.log
./rt_control_c48_intel.log
./rt_control_c48.v2.sfc_intel.log
./rt_control_CubedSphereGrid_intel.log
./rt_control_CubedSphereGrid_parallel_intel.log
./rt_control_decomp_p8_intel.log
./rt_control_flake_intel.log
./rt_control_iovr4_intel.log
./rt_control_iovr5_intel.log
./rt_control_latlon_intel.log
./rt_control_lndp_intel.log
./rt_control_noqr_p8_intel.log
./rt_control_p8_lndp_intel.log
./rt_control_p8_mynn_intel.log
./rt_control_p8_rrtmgp_intel.log
./rt_control_p8_ugwpv1_intel.log
./rt_control_p8.v2.sfc_intel.log
./rt_control_stochy_intel.log
./rt_control_stochy_restart_intel.log
./rt_control_wrtGauss_netcdf_parallel_intel.log
./rt_cpld_2threads_p8_intel.log
./rt_cpld_control_c48_intel.log
./rt_cpld_control_ciceC_p8_intel.log
./rt_cpld_control_gfsv17_intel.log
./rt_cpld_control_noaero_p8_agrid_intel.log
./rt_cpld_control_noaero_p8_intel.log
./rt_cpld_control_nowave_noaero_p8_intel.log
./rt_cpld_control_p8_faster_intel.log
./rt_cpld_control_p8_intel.log
./rt_cpld_control_p8_mixedmode_intel.log
./rt_cpld_control_p8.v2.sfc_intel.log
./rt_cpld_control_pdlib_p8_intel.log
./rt_cpld_control_qr_p8_intel.log
./rt_cpld_debug_gfsv17_intel.log
./rt_cpld_debug_pdlib_p8_intel.log
./rt_cpld_decomp_p8_intel.log
./rt_cpld_mpi_gfsv17_intel.log
./rt_cpld_mpi_p8_intel.log
./rt_cpld_mpi_pdlib_p8_intel.log
./rt_cpld_restart_gfsv17_intel.log
./rt_cpld_restart_p8_intel.log
./rt_cpld_restart_pdlib_p8_intel.log
./rt_cpld_restart_qr_p8_intel.log
./rt_cpld_s2sa_p8_intel.log
./rt_merra2_thompson_intel.log
./rt_regional_2threads_intel.log
./rt_regional_control_intel.log
./rt_regional_decomp_intel.log
./rt_regional_spp_sppt_shum_skeb_intel.log
Hang-Lei-NOAA commented 3 weeks ago

Okay, let's push forward. I tested the special case and one aerosol case last night. They are fine. We will do full testing once it is temporally set up on wcoss2.

JessicaMeixner-NOAA commented 2 weeks ago

@Hang-Lei-NOAA Do you have an estimate about when you think this might be resolved? I'm asking in context of efforts trying to update the global-workflow: https://github.com/NOAA-EMC/global-workflow/pull/2505 and just trying to figure out what the fastest path to updating the model in the global-workflow. Currently the workflow cannot update because HDF5 usage with CICE means you cannot use linked files. I confirmed that this is the same behavior with hdf5 on hera as well. While there are plans to move away from linked files in the global-workflow, it will take some time. So I'm curious if this will be available relatively soon.

Hang-Lei-NOAA commented 2 weeks ago

@JessicaMeixner-NOAA Since modifying the netcdf pio Esmf, with netcdf, we delivered it and closely working with GDIT. As my recent check, they said that it will be ready on wcoss2 cactus for testing on this Thursday. It has been very fast. These updates have already been available on acorn. You can start test on acorn.

JessicaMeixner-NOAA commented 2 weeks ago

Thanks for the information @Hang-Lei-NOAA

Hang-Lei-NOAA commented 2 weeks ago

@BrianCurtis-NOAA @junwang-noaa lib-C series are available on CACTUS for testing. Please fully test it as soon as possible.

JessicaMeixner-NOAA commented 2 weeks ago

@Hang-Lei-NOAA apologies if I missed this information elsewhere, but can you share where exactly this new module file is on Cactus for testing?

BrianCurtis-NOAA commented 2 weeks ago

I have a modulefile i'm testing, i'll pass it along if all goes well.

DeniseWorthen commented 2 weeks ago

@BrianCurtis-NOAA I think it would be worthwhile to be able to confirm that the G-W, using linked files, is functional. I presume that is the testing that @JessicaMeixner-NOAA could do in parallel with yours.

Hang-Lei-NOAA commented 2 weeks ago

@Jessica Meixner - NOAA Federal @.> It is on prod. It is best to follow Brian's test. He is testing for the whole UFS. So, just login the system, you will see: @.:~> module load PrgEnv-intel @.:~> module load craype @.:~> module load intel @.:~> module load cray-mpich @.:~> module ava


WCOSS2 Intel Compiled MPI Libraries and Tools

adcirc/v55.10 esmf/7.1.0r fms-C/2023.04 hdf5-C/1.14.0 ncdiag-A/1.1.2 nemsio/2.5.2 netcdf/4.7.4 (D) pio/2.5.10 upp/8.3.0 adcirc/v55.12 (D) esmf/8.0.1 fms/2022.03 hdf5/1.10.6 (D) ncdiag/1.0.0 nemsio/2.5.4 (D) netcdf/4.9.0 pnetcdf-C/1.12.2 upp/10.0.8 (D) cdo/1.9.8 (D) esmf/8.1.0 fms/2022.04 (D) hdf5/1.12.2 ncdiag/1.1.1 (D) nemsiogfs/2.5.3 pio-A/2.5.10 pnetcdf/1.12.2 w3emc/2.7.3 esmf-A/8.4.2 esmf/8.1.1 (D) fms/2023.02.01 mapl-A/2.35.2-esmf-8.4.2 ncio-A/1.1.2 netcdf-A/4.9.2 pio-B/2.5.10 schism/5.11.0 wgrib2/2.0.8_mpi esmf-B/8.5.0 esmf/8.4.1 hdf5-A/1.14.0 mapl-B/2.40.3 ncio/1.0.0 netcdf-B/4.9.2 pio-C/2.5.10 scotch/7.0.4 wrf_io/1.1.1 esmf-C/8.6.0 fms-A/2023.01 hdf5-B/1.14.0 mapl-C/2.40.3 ncio/1.1.2 (D) netcdf-C/4.9.2 pio/2.5.3 (D) upp/8.2.0 wrf_io/1.2.0 (D)

On Thu, May 2, 2024 at 9:54 AM Brian Curtis @.***> wrote:

I have a modulefile i'm testing, i'll pass it along if all goes well.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2232#issuecomment-2090560808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFCSJ2SF5EMNM4GHITTZAJAR7AVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQGU3DAOBQHA . You are receiving this because you were mentioned.Message ID: @.***>

BrianCurtis-NOAA commented 2 weeks ago

Here's what I am using for testing.

/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/modulefiles/ufs_wcoss2.intel.lua

JessicaMeixner-NOAA commented 2 weeks ago

Thanks @BrianCurtis-NOAA @DeniseWorthen and @Hang-Lei-NOAA.

I will test in the g-w this afternoon using the modules from @BrianCurtis-NOAA and will report back how this goes.

BrianCurtis-NOAA commented 2 weeks ago

This is what i've got from my UFSWM testing, this also includes FMS 2023.04 and ESMF 8.6.0/MAPL built with that.

in compile_atml_debug_intel:

/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/FV3/ccpp/physics/physics/Interstitials/UFS_SCM_NEPTUNE/gcycle.F90(236): error #8284: If the actual argument is scalar, the dummy argument shall be scalar unless the actual argument is of type character or is an element of an array that is not assumed shape, pointer, or polymorphic.   [SIG1T]
      CALL SFCCYCLE (9998, npts, max(lsoil,lsoil_lsm), sig1t, fhcyc, &
-----------^

in cpld_control_gfsv17_iau_intel:

Comparing history/iceh_06h.2021-03-23-43200.nc .....USING NCCMP......NOT IDENTICAL

cpld_restart_pdlib_p8 intel (finished but interrupted?) control_p8_atmlnd_sbs intel ((wallclock) failed to complete run) control_p8_atmlnd intel ((wallclock) failed to complete run) control_restart_p8_atmlnd intel (compare test failed, not run) control_p8_atmlnd_debug intel (compile failure, not run)

@DeniseWorthen I believe the iceh file not reproducing is correct because it switched to pnetcdf this time, correct?

The (finished but interrupted) issue i've seen before but it's intermittent and not easily reproduced, rerunning usually is successful.

@junwang-noaa should the p8 atmlnd (& sbs) tests be running out of wallclock? It almost seems like it hung somewhere and hit wallclock vs just not being able to complete in time. But i recall some hang issue we've seen before but I'm unsure if it would be remotely related.

uturuncoglu commented 2 weeks ago

@BrianCurtis-NOAA Just let me if you need anything from my side. We were not having issue in terms of wall clock time with land tests in the past. Right? So, I am not sure why they have issue now. Is this a particular platform? I could also play those tests and reduce the simulation length or I/O if we need.

DeniseWorthen commented 2 weeks ago

@BrianCurtis-NOAA I would not be surprised if the history file was different. Does the nccmp log give any information about the difference? If you want to place the baseline file and the new run file on hera so I can use cprnc, I can do that.

DeniseWorthen commented 2 weeks ago

I can see in the global attributes that the file was created w/ io_pio2 pnetcdf2 vs the previous io_pio2 hdf5, so it is now able to use pnetcdf2. But cprnc and nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format show no differences in the two files you provided, so I'm not sure why the comparison failed in the RT.

DeniseWorthen commented 2 weeks ago

@BrianCurtis-NOAA These are the stats for the last commit and the lnd tests. Do you see indications we're near wallclock on WCOSS2?

https://github.com/ufs-community/ufs-weather-model/blob/26cb9e60479958c4109a826c834333dcdd728a92/tests/logs/RegressionTests_wcoss2.log#L247-L253

BrianCurtis-NOAA commented 2 weeks ago

It seemed more like it was a hang than an actual test issue but I didn't look into it more. I don't know why a test took 55:38 without failing, we should look into that at a later time

Hang-Lei-NOAA commented 2 weeks ago

@BrianCurtis-NOAA could we first check the results of those sensitive tests for this set?

JessicaMeixner-NOAA commented 2 weeks ago

The global-workflow test worked for a C48 S2SW forecast only case on Cactus. I forgot to update the model to the top of develop so it's April 17 + the module file @BrianCurtis-NOAA pointed to.

In case anyone is curious g-w clone: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/global-workflow COM: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/test02/COMROOT

I can update the model to the top of develop (or a branch if someone points me to it) again tomorrow morning if that would be informative.

aerorahul commented 2 weeks ago

The global-workflow test worked for a C48 S2SW forecast only case on Cactus. I forgot to update the model to the top of develop so it's April 17 + the module file @BrianCurtis-NOAA pointed to.

In case anyone is curious g-w clone: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/global-workflow COM: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/test02/COMROOT

I can update the model to the top of develop (or a branch if someone points me to it) again tomorrow morning if that would be informative.

Thanks for the confirmation @JessicaMeixner-NOAA

DeniseWorthen commented 1 week ago

@BrianCurtis-NOAA Were you able to complete a full RT using the libraries? If yes, then we need a PR to use those libraries and remove the WCOSS2 specification in default_vars.

BrianCurtis-NOAA commented 1 week ago

I am not able to get the full RT suite to pass. I'm still looking into it.

DeniseWorthen commented 1 week ago

Your module file has an update for esmf/8.6.0, correct? It may be there is a feature in the lnd component model which is impacted by that? @uturuncoglu , what do you think?

BrianCurtis-NOAA commented 1 week ago

Yes the way that NCO did the C-libs includes ESMF 8.6.0 and FMS 2023.04

DeniseWorthen commented 1 week ago

OK, I'm running the atml tests on hera now using SS1.6 which also uses those two. If they also fail on Hera, I think we can turn off the tests to be able to move forward on WCOSS2 w/ the fix for the G-W.

DeniseWorthen commented 1 week ago

The atml tests run fine on Hera w/ 8.6 and 2023.04

DeniseWorthen commented 1 week ago

There is an ERROR in the ESMF log file

20240503 142643.018 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile1.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was no\
t turned on when netCDF was built.)
DeniseWorthen commented 1 week ago

I don't understand the input files. They should be "v2", which means that they have a global attribute

// global attributes:
        :file_version = "V2" ;

But I don't see that attribute either in my test hera run or in your run on WCOSS2.

DeniseWorthen commented 1 week ago

It looks like there maybe an extraneous copy of sfc_data into the INPUT run-directory. I see the initial copy (from the v2 directory) and then possibly a later cp from the non-v2 directory.

I'm not sure this is at all related to the hang that is occurring on WCOSS2 however.

UPDATE: I manually copied in the V2 files to a run directory and the test still fails w/ the PET log message above

DeniseWorthen commented 1 week ago

@uturuncoglu @barlage

The land component tests are failing on WCOSS2 after updating the lib to enable pnetcdf and also ESMF 8.6.0. The log message is:

20240503 162628.213 INFO             PET150 (lnd_comp_io): (read_tiled_file) adding land_frac to FB
20240503 162628.220 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile1.nc, (PIO/PNetCDF error =\
 NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.224 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile2.nc, (PIO/PNetCDF error =\
 NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.251 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile3.nc, (PIO/PNetCDF error =\
 NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.259 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile4.nc, (PIO/PNetCDF error =\
 NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.266 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile5.nc, (PIO/PNetCDF error =\
 NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.266 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile6.nc, (PIO/PNetCDF error =\
 NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.268 ERROR            PET150 ESMCI_PIO_Handler.C:617 ESMCI::PIO_Handler::arrayReadOne Unable to read from file  - file not open
uturuncoglu commented 1 week ago

@DeniseWorthen which test is this? I am testing land component through CI. It is coupled with data atmosphere configuration and using esmf develop branch. So, I am not expecting any issue with ESMF side. Maybe there is something in the ESMF installation. It seems that the build complaining like (PIO/PNetCDF error =\ NetCDF: Attempt to use feature that was not turned on when netCDF was built.. It there anything different in this platform in terms of ESMF installation?

uturuncoglu commented 1 week ago

@billsacks might have some idea.

BrianCurtis-NOAA commented 1 week ago

"NetCDF: Attempt to use feature that was not turned on when netCDF was built."

What feature wasn't built into netCDF ? Once we know we can talk to NCO and get that adjusted.

DeniseWorthen commented 1 week ago

This is WCOSS2, and we're testing w/ 8.6.0. The lib team had to rebuild pio because we were not able to access pnetcdf through it. But now these land tests are failing (the only ones). I was able to build and run these same tests using 8.6 on Hera, so I don't think that is not the issue. I suspect the issue is in the new library build, but I don't know what exactly. I've seen this exact message from PIO before, but that was when I was trying to read a netcdf-classic format file w/ netcdf4.

DeniseWorthen commented 1 week ago

@BrianCurtis-NOAA See this comment https://github.com/JCSDA/spack-stack/issues/991#issuecomment-1932988544

Hang-Lei-NOAA commented 1 week ago

@BrianCurtis-NOAA Have you ever seen this error using the libraries that built to you on wcoss2 with the specific test that Denise did?

BrianCurtis-NOAA commented 1 week ago

@BrianCurtis-NOAA Have you ever seen this error using the libraries that built to you on wcoss2 with the specific test that Denise did?

Yes, I re-tested with those and had the same error. There was some work going on with Acorn so it might be why this was missed.

Hang-Lei-NOAA commented 1 week ago

@BrianCurtis-NOAA This is Bongi's build. I remembered that you did confirmed with the set which I set on dogwoods. Is your test include this specific test? Please also let us know what specific test is , and where is your testbed. I will repeat it with my builds on cactus.