Open BrianCurtis-NOAA opened 1 month ago
@HangLei-NOAA would you please check the library on wcoss2 and install a test version of netcdf with pio on acorn for us to test? Thanks
Apologies for missing to post here, but there is an install Hang has made at: /lfs/h2/emc/eib/save/Hang.Lei/forgdit/nco_wcoss2/install
@BrianCurtis-NOAA Are you testing Hang's install?
Yes. I will do more today once I get todays PR started.
using the updated libraries: /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_63569/
MPICH ERROR [Rank 170] [job id 0882c1cf-0967-46e2-88a1-63f55d8cd95f] [Wed Apr 17 17:48:46 2024] [nid001019] - Abort(128) (rank 170 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 170
(abort_ice)ABORTED:
(abort_ice) called from ice_pio.F90
(abort_ice) line number 223
(abort_ice) error =
(ice_pio_check)Unknown Error, (ice_pio_init) ERROR: Failed to create file ./his
tory/iceh_ic.2021-03-22-21600.nc
@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test. If I still get the failure, I will install a new pnetcdf library. Currently, we are using the system installed pnetcdf library.
@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test. If I still get the failure, I will install a new pnetcdf library. Currently, we are using the system installed pnetcdf library.
@Hang-Lei-NOAA the develop branch of ufs-weather-model has the issue, use:
./rt.sh -a <ACCNR> -n "cpld_control_gfsv17 intel"
but first remove (or comment out): https://github.com/ufs-community/ufs-weather-model/blob/47c00995706380f9827a7e98bd242ffc6e4a7f01/tests/default_vars.sh#L781-L784
@BrianCurtis-NOAA I have checked, and get this fixed with rebuild the netcdf/4.9.2 with pnetcdf. /lfs/h2/emc/eib/noscrub/Hang.Lei/works/brianufs/tests/logs/log_wcoss2/run_cpld_control_gfsv17_intel.log
Please use: /lfs/h2/emc/eib/noscrub/Hang.Lei/works/brianufs/modulefiles/ufs_wcoss2.intel.lua
Please copy it soon, I will do more sensitivity tests on the UFS to use system installed libs this afternoon after 3pm. Thanks.
@Hang-Lei-NOAA I can confirm that your lua file works for that test. Please proceed with getting these adjustments made on WCOSS2 Dev.
@junwang-noaa @BrianCurtis-NOAA Bongi has set up an installation on acorn systems installation, please load the intel environment. You will see: For now, here is what is deployed to production on Acorn:
$ module -t avail 2>&1 | grep -- "-C/." esmf-C/8.6.0 fms-C/2023.04 hdf5-C/1.14.0 mapl-C/2.40.3 netcdf-C/4.9.2 pio-C/2.5.10 pnetcdf-C/1.12.2
please test them and let me know if any issues found. Thanks
I've been able to load those modules and build/compile a test case OK. I am running the full suite now on Acorn using WCOSS2 setup. I will pass along the results as soon as I can.
@Hang-Lei-NOAA Bongi needed Acorn for other things today, so I was only able to run a subset of tests with the -C libraries but they included tests for cpld, control, regional, 2threads, mpi, restarts, p8, gfsv17, decomp, (the problem case from before) all with success (PASS). Im comfortable saying the -C libraries are OK to use on WCOSS2.
brian.curtis@alogin03:/lfs/h1/emc/nems/noscrub/brian.curtis/git/ufs-community/ufs-weather-model/tests/log
s/log_acorn> grep -ril PASS ./rt*.log
./rt_control_2threads_p8_intel.log
./rt_control_c192_intel.log
./rt_control_c384gdas_intel.log
./rt_control_c384_intel.log
./rt_control_c48_intel.log
./rt_control_c48.v2.sfc_intel.log
./rt_control_CubedSphereGrid_intel.log
./rt_control_CubedSphereGrid_parallel_intel.log
./rt_control_decomp_p8_intel.log
./rt_control_flake_intel.log
./rt_control_iovr4_intel.log
./rt_control_iovr5_intel.log
./rt_control_latlon_intel.log
./rt_control_lndp_intel.log
./rt_control_noqr_p8_intel.log
./rt_control_p8_lndp_intel.log
./rt_control_p8_mynn_intel.log
./rt_control_p8_rrtmgp_intel.log
./rt_control_p8_ugwpv1_intel.log
./rt_control_p8.v2.sfc_intel.log
./rt_control_stochy_intel.log
./rt_control_stochy_restart_intel.log
./rt_control_wrtGauss_netcdf_parallel_intel.log
./rt_cpld_2threads_p8_intel.log
./rt_cpld_control_c48_intel.log
./rt_cpld_control_ciceC_p8_intel.log
./rt_cpld_control_gfsv17_intel.log
./rt_cpld_control_noaero_p8_agrid_intel.log
./rt_cpld_control_noaero_p8_intel.log
./rt_cpld_control_nowave_noaero_p8_intel.log
./rt_cpld_control_p8_faster_intel.log
./rt_cpld_control_p8_intel.log
./rt_cpld_control_p8_mixedmode_intel.log
./rt_cpld_control_p8.v2.sfc_intel.log
./rt_cpld_control_pdlib_p8_intel.log
./rt_cpld_control_qr_p8_intel.log
./rt_cpld_debug_gfsv17_intel.log
./rt_cpld_debug_pdlib_p8_intel.log
./rt_cpld_decomp_p8_intel.log
./rt_cpld_mpi_gfsv17_intel.log
./rt_cpld_mpi_p8_intel.log
./rt_cpld_mpi_pdlib_p8_intel.log
./rt_cpld_restart_gfsv17_intel.log
./rt_cpld_restart_p8_intel.log
./rt_cpld_restart_pdlib_p8_intel.log
./rt_cpld_restart_qr_p8_intel.log
./rt_cpld_s2sa_p8_intel.log
./rt_merra2_thompson_intel.log
./rt_regional_2threads_intel.log
./rt_regional_control_intel.log
./rt_regional_decomp_intel.log
./rt_regional_spp_sppt_shum_skeb_intel.log
Okay, let's push forward. I tested the special case and one aerosol case last night. They are fine. We will do full testing once it is temporally set up on wcoss2.
@Hang-Lei-NOAA Do you have an estimate about when you think this might be resolved? I'm asking in context of efforts trying to update the global-workflow: https://github.com/NOAA-EMC/global-workflow/pull/2505 and just trying to figure out what the fastest path to updating the model in the global-workflow. Currently the workflow cannot update because HDF5 usage with CICE means you cannot use linked files. I confirmed that this is the same behavior with hdf5 on hera as well. While there are plans to move away from linked files in the global-workflow, it will take some time. So I'm curious if this will be available relatively soon.
@JessicaMeixner-NOAA Since modifying the netcdf pio Esmf, with netcdf, we delivered it and closely working with GDIT. As my recent check, they said that it will be ready on wcoss2 cactus for testing on this Thursday. It has been very fast. These updates have already been available on acorn. You can start test on acorn.
Thanks for the information @Hang-Lei-NOAA
@BrianCurtis-NOAA @junwang-noaa lib-C series are available on CACTUS for testing. Please fully test it as soon as possible.
@Hang-Lei-NOAA apologies if I missed this information elsewhere, but can you share where exactly this new module file is on Cactus for testing?
I have a modulefile i'm testing, i'll pass it along if all goes well.
@BrianCurtis-NOAA I think it would be worthwhile to be able to confirm that the G-W, using linked files, is functional. I presume that is the testing that @JessicaMeixner-NOAA could do in parallel with yours.
@Jessica Meixner - NOAA Federal @.> It is on prod. It is best to follow Brian's test. He is testing for the whole UFS. So, just login the system, you will see: @.:~> module load PrgEnv-intel @.:~> module load craype @.:~> module load intel @.:~> module load cray-mpich @.:~> module ava
adcirc/v55.10 esmf/7.1.0r fms-C/2023.04 hdf5-C/1.14.0 ncdiag-A/1.1.2 nemsio/2.5.2 netcdf/4.7.4 (D) pio/2.5.10 upp/8.3.0 adcirc/v55.12 (D) esmf/8.0.1 fms/2022.03 hdf5/1.10.6 (D) ncdiag/1.0.0 nemsio/2.5.4 (D) netcdf/4.9.0 pnetcdf-C/1.12.2 upp/10.0.8 (D) cdo/1.9.8 (D) esmf/8.1.0 fms/2022.04 (D) hdf5/1.12.2 ncdiag/1.1.1 (D) nemsiogfs/2.5.3 pio-A/2.5.10 pnetcdf/1.12.2 w3emc/2.7.3 esmf-A/8.4.2 esmf/8.1.1 (D) fms/2023.02.01 mapl-A/2.35.2-esmf-8.4.2 ncio-A/1.1.2 netcdf-A/4.9.2 pio-B/2.5.10 schism/5.11.0 wgrib2/2.0.8_mpi esmf-B/8.5.0 esmf/8.4.1 hdf5-A/1.14.0 mapl-B/2.40.3 ncio/1.0.0 netcdf-B/4.9.2 pio-C/2.5.10 scotch/7.0.4 wrf_io/1.1.1 esmf-C/8.6.0 fms-A/2023.01 hdf5-B/1.14.0 mapl-C/2.40.3 ncio/1.1.2 (D) netcdf-C/4.9.2 pio/2.5.3 (D) upp/8.2.0 wrf_io/1.2.0 (D)
On Thu, May 2, 2024 at 9:54 AM Brian Curtis @.***> wrote:
I have a modulefile i'm testing, i'll pass it along if all goes well.
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2232#issuecomment-2090560808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKWSMFCSJ2SF5EMNM4GHITTZAJAR7AVCNFSM6AAAAABGCOWMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQGU3DAOBQHA . You are receiving this because you were mentioned.Message ID: @.***>
Here's what I am using for testing.
/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/modulefiles/ufs_wcoss2.intel.lua
Thanks @BrianCurtis-NOAA @DeniseWorthen and @Hang-Lei-NOAA.
I will test in the g-w this afternoon using the modules from @BrianCurtis-NOAA and will report back how this goes.
This is what i've got from my UFSWM testing, this also includes FMS 2023.04 and ESMF 8.6.0/MAPL built with that.
in compile_atml_debug_intel:
/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/FV3/ccpp/physics/physics/Interstitials/UFS_SCM_NEPTUNE/gcycle.F90(236): error #8284: If the actual argument is scalar, the dummy argument shall be scalar unless the actual argument is of type character or is an element of an array that is not assumed shape, pointer, or polymorphic. [SIG1T]
CALL SFCCYCLE (9998, npts, max(lsoil,lsoil_lsm), sig1t, fhcyc, &
-----------^
in cpld_control_gfsv17_iau_intel:
Comparing history/iceh_06h.2021-03-23-43200.nc .....USING NCCMP......NOT IDENTICAL
cpld_restart_pdlib_p8 intel (finished but interrupted?) control_p8_atmlnd_sbs intel ((wallclock) failed to complete run) control_p8_atmlnd intel ((wallclock) failed to complete run) control_restart_p8_atmlnd intel (compare test failed, not run) control_p8_atmlnd_debug intel (compile failure, not run)
@DeniseWorthen I believe the iceh file not reproducing is correct because it switched to pnetcdf this time, correct?
The (finished but interrupted) issue i've seen before but it's intermittent and not easily reproduced, rerunning usually is successful.
@junwang-noaa should the p8 atmlnd (& sbs) tests be running out of wallclock? It almost seems like it hung somewhere and hit wallclock vs just not being able to complete in time. But i recall some hang issue we've seen before but I'm unsure if it would be remotely related.
@BrianCurtis-NOAA Just let me if you need anything from my side. We were not having issue in terms of wall clock time with land tests in the past. Right? So, I am not sure why they have issue now. Is this a particular platform? I could also play those tests and reduce the simulation length or I/O if we need.
@BrianCurtis-NOAA I would not be surprised if the history file was different. Does the nccmp log give any information about the difference? If you want to place the baseline file and the new run file on hera so I can use cprnc, I can do that.
I can see in the global attributes that the file was created w/ io_pio2 pnetcdf2
vs the previous io_pio2 hdf5
, so it is now able to use pnetcdf2. But cprnc
and nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format
show no differences in the two files you provided, so I'm not sure why the comparison failed in the RT.
@BrianCurtis-NOAA These are the stats for the last commit and the lnd tests. Do you see indications we're near wallclock on WCOSS2?
It seemed more like it was a hang than an actual test issue but I didn't look into it more. I don't know why a test took 55:38 without failing, we should look into that at a later time
@BrianCurtis-NOAA could we first check the results of those sensitive tests for this set?
The global-workflow test worked for a C48 S2SW forecast only case on Cactus. I forgot to update the model to the top of develop so it's April 17 + the module file @BrianCurtis-NOAA pointed to.
In case anyone is curious g-w clone: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/global-workflow COM: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/test02/COMROOT
I can update the model to the top of develop (or a branch if someone points me to it) again tomorrow morning if that would be informative.
The global-workflow test worked for a C48 S2SW forecast only case on Cactus. I forgot to update the model to the top of develop so it's April 17 + the module file @BrianCurtis-NOAA pointed to.
In case anyone is curious g-w clone: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/global-workflow COM: /lfs/h2/emc/couple/noscrub/jessica.meixner/testgw/test02/COMROOT
I can update the model to the top of develop (or a branch if someone points me to it) again tomorrow morning if that would be informative.
Thanks for the confirmation @JessicaMeixner-NOAA
@BrianCurtis-NOAA Were you able to complete a full RT using the libraries? If yes, then we need a PR to use those libraries and remove the WCOSS2 specification in default_vars
.
I am not able to get the full RT suite to pass. I'm still looking into it.
Your module file has an update for esmf/8.6.0, correct? It may be there is a feature in the lnd component model which is impacted by that? @uturuncoglu , what do you think?
Yes the way that NCO did the C-libs includes ESMF 8.6.0 and FMS 2023.04
OK, I'm running the atml tests on hera now using SS1.6 which also uses those two. If they also fail on Hera, I think we can turn off the tests to be able to move forward on WCOSS2 w/ the fix for the G-W.
The atml tests run fine on Hera w/ 8.6 and 2023.04
There is an ERROR in the ESMF log file
20240503 142643.018 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile1.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was no\
t turned on when netCDF was built.)
I don't understand the input files. They should be "v2", which means that they have a global attribute
// global attributes:
:file_version = "V2" ;
But I don't see that attribute either in my test hera run or in your run on WCOSS2.
It looks like there maybe an extraneous copy of sfc_data into the INPUT run-directory. I see the initial copy (from the v2 directory) and then possibly a later cp from the non-v2 directory.
I'm not sure this is at all related to the hang that is occurring on WCOSS2 however.
UPDATE: I manually copied in the V2 files to a run directory and the test still fails w/ the PET log message above
@uturuncoglu @barlage
The land component tests are failing on WCOSS2 after updating the lib to enable pnetcdf and also ESMF 8.6.0. The log message is:
20240503 162628.213 INFO PET150 (lnd_comp_io): (read_tiled_file) adding land_frac to FB
20240503 162628.220 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile1.nc, (PIO/PNetCDF error =\
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.224 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile2.nc, (PIO/PNetCDF error =\
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.251 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile3.nc, (PIO/PNetCDF error =\
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.259 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile4.nc, (PIO/PNetCDF error =\
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.266 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile5.nc, (PIO/PNetCDF error =\
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.266 WARNING PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF Unable to open existing file: INPUT/oro_data.tile6.nc, (PIO/PNetCDF error =\
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240503 162628.268 ERROR PET150 ESMCI_PIO_Handler.C:617 ESMCI::PIO_Handler::arrayReadOne Unable to read from file - file not open
@DeniseWorthen which test is this? I am testing land component through CI. It is coupled with data atmosphere configuration and using esmf develop branch. So, I am not expecting any issue with ESMF side. Maybe there is something in the ESMF installation. It seems that the build complaining like (PIO/PNetCDF error =\ NetCDF: Attempt to use feature that was not turned on when netCDF was built.
. It there anything different in this platform in terms of ESMF installation?
@billsacks might have some idea.
"NetCDF: Attempt to use feature that was not turned on when netCDF was built."
What feature wasn't built into netCDF ? Once we know we can talk to NCO and get that adjusted.
This is WCOSS2, and we're testing w/ 8.6.0. The lib team had to rebuild pio because we were not able to access pnetcdf through it. But now these land tests are failing (the only ones). I was able to build and run these same tests using 8.6 on Hera, so I don't think that is not the issue. I suspect the issue is in the new library build, but I don't know what exactly. I've seen this exact message from PIO before, but that was when I was trying to read a netcdf-classic format file w/ netcdf4.
@BrianCurtis-NOAA See this comment https://github.com/JCSDA/spack-stack/issues/991#issuecomment-1932988544
@BrianCurtis-NOAA Have you ever seen this error using the libraries that built to you on wcoss2 with the specific test that Denise did?
@BrianCurtis-NOAA Have you ever seen this error using the libraries that built to you on wcoss2 with the specific test that Denise did?
Yes, I re-tested with those and had the same error. There was some work going on with Acorn so it might be why this was missed.
@BrianCurtis-NOAA This is Bongi's build. I remembered that you did confirmed with the set which I set on dogwoods. Is your test include this specific test? Please also let us know what specific test is , and where is your testbed. I will repeat it with my builds on cactus.
Description
PR #2145 brought in a change where CICE switched to use pnetcdf in PIO instead of hdf5. This worked on all machines except WCOSS2.
This leads us to believe that the PIO install on WCOSS2 was not built with proper pnetcdf support.
Efforts are ongoing trying to determine the specific of any build differences between spack-stack and the hpc-stack on WCOSS2.
To Reproduce:
Run cpld_control_gfsv17 intel RT with develop branch of UFSWM (From PR #2145 ) on WCOSS2 dev machine
Needs alongside solving of this issue