ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 244 forks source link

model crashed at day 91.75 when compiled with S2S #2320

Open ShanSunNOAA opened 3 months ago

ShanSunNOAA commented 3 months ago

Description

The UFS model crashed at time step 13320 with error of

341: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr ... ... 345: ufs_model.x 0000000005CA5A13 ice_pio_mp_ice_pi 221 ice_pio.F90 345: ufs_model.x 0000000005C85EFE ice_history_write 173 ice_history_write.F90 345: ufs_model.x 0000000005A1225A ice_history_mp_ac 4130 ice_history.F90 345: ufs_model.x 0000000005C7B813 cice_runmod_mp_ci 369 CICE_RunMod.F90 345: ufs_model.x 00000000059BED10 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90 345: ufs_model.x 00000000006A1B08 Unknown Unknown Unknown

see details in /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log

To Reproduce:

The experiment used the following hash of https://github.com/ufs-community/ufs-weather-model commit 5bec704243286421fc613838fc67a2129e96acd6 Author: Brian Curtis 64433609+BrianCurtis-NOAA@users.noreply.github.com Date: Fri May 31 14:52:06 2024 -0400

Additional context

The model was able to run 9 months successfully with an earlier hash in April: commit 45c8b2ada88e7243b912deeebf4c8d90d4fdc5c5 Author: JONG KIM jong.kim@noaa.gov Date: Thu Apr 4 19:49:13 2024 -0400

Output

The log containing the error message is available at /scratch2/BMC/gsd-fv3-dev/sun/emcwf_0610/comrot/tst/logs/2015052100/gfsfcst.log. I'd be happy to provide more details if needed. Thanks.

junwang-noaa commented 3 months ago

@DeniseWorthen may I ask if there is any restriction on cice file names for long forecast time?

DeniseWorthen commented 3 months ago

@junwang-noaa This doesn't seem to be an issue related to filename length. Files are always named as yyyy-mm-dd-secsofday in the model. The line it is pointing to in ice_pio makes me believe this is an issue more likely to be related to the previous sym-link issue.

Do you have the associated ice_diag.d file? That is the log that ice writes out.

ShanSunNOAA commented 3 months ago

@DeniseWorthen @junwang-noaa Thanks for the quick responses from both of you. I think I found out why. It is this line in cice_wrapper_mod.F90: write(filename,'(a,i3.3)')'log.ice.f',int(hour) that need to be (i4.4) to accommodate 4-digit hours. I am testing it now to make sure this is the only change that is needed.

Thanks!

junwang-noaa commented 3 months ago

It might be the cause. I saw a file 'log.ice.f***' in the run directory. We may still run into problem for climate runs if using "i4.4". In fv3atm, we use:

ndig = max(log10(nf_hours+0.5)+1., 3.)
        write(cform, '("(I",I1,".",I1,")")') ndig, ndig
        write(cfhour, cform) nf_hours

to get the number of digits for forecast hours

ShanSunNOAA commented 3 months ago

@junwang-noaa Thanks for the info. FV3 uses "max" for the forecast hours so the model won't crash. Can we borrow the same formula to get forecast hour in CICE?

DeniseWorthen commented 3 months ago

Both ATM and CICE write to a specific file name. In MOM, this was just added as a write to stdout. We should probably fix that. For MOM and CICE, we'd need to "find" the number of hours from the driver clock since currently fhmax is only in model_configure.

DeniseWorthen commented 3 months ago

@ShanSunNOAA Unless you're actually compiling runtime statistics (ie, the time for each model advance), then I would just set RunTimeLog = false in the allcomp part of ufs.configure for now.

ShanSunNOAA commented 3 months ago

@DeniseWorthen Thanks for your info. I cannot find RunTimeLog in the log file generated by workflow. Where do I set it in the workflow configuration? Thanks!

junwang-noaa commented 3 months ago

@DeniseWorthen in Fv3ATM, we compute the number of digits from the forecast hour, not fhmax. Can we use "hour" in ufs_logfhour to compute the format of fhour in the log file name? Also it looks to me that runtimelog is not used to control outputting cice log file, I saw the code in nuopc/cmeps/CICE_RunMod.F90.

         if (my_task == master_task) then
            do ns = 1,nstreams
               if (write_history(ns) .and. histfreq(ns) .eq. 'h') then
                  call calendar_sec2hms(msec,hh,mm,ss)
                  write(logmsg,'(6(i4,2x))')myear,mmonth,mday,hh,mm,ss
                  call ufs_logfhour(trim(logmsg),timesecs/c3600)
               end if
            end do
         end if
      end subroutine ice_step
ShanSunNOAA commented 3 months ago

Unfortunately, after modifying the log.ice filename, the model still crashed at day 91.75, with error 340: ufs_model.x 0000000005B97903 ice_pio_mp_ice_pi 221 ice_pio.F90 340: ufs_model.x 0000000005B77DEE ice_history_write 173 ice_history_write.F90 340: ufs_model.x 000000000590414A ice_history_mp_ac 4130 ice_history.F90 340: ufs_model.x 0000000005B6D703 cice_runmod_mp_ci 369 CICE_RunMod.F90 340: ufs_model.x 00000000058B0C00 ice_comp_nuopc_mp 1179 ice_comp_nuopc.F90

see /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/logs/2015052100/gfsfcst.log The ice_diag.d is available at /scratch2/BMC/wrfruc/Shan.Sun/me_wf/comrot/hydro1/gfs.20150521/00/model_data/ice/history/. The RUNDIRS is available at /scratch1/BMC/gsd-fv3-dev/NCEPDEV/global/Shan.Sun/stmp2/RUNDIRS/hydro1/gfsfcst.2015052100/fcst.1576533/.

Any suggestions on how to run beyond 3 months would be greatly appreciated. How to turn off RunTimeLog? Thanks.

DeniseWorthen commented 3 months ago

@ShanSunNOAA Sorry for the confusion. Jun is right. The feature for runtime logging as well as the logging to mark the end of the history write was put in at the same time and I didn't think through my answer.

I don't believe CICE has a concept of hours being anything other than 1-24. Let me take a look. In any case, it doesn't seem that is the issue, since your new run failed also. If it is still pointing to the ice_pio line, then I still suspect it has something to do w/ how pio is trying to write the file.

DeniseWorthen commented 3 months ago

@ShanSunNOAA I've been able to reproduce the error using the low resolution c48-coupled model in the RTs. So in this case, there is no sym-linked files, so that takes one issue out of the mix. In this case, it failed at day 86.

73: Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x150a7a430bcc]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x150a79e0adf1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b1eb9) [0x150a79ad9eb9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x176584) [0x150a7999e584]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x17a9f9) [0x150a799a29f9]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x199a60) [0x150a799c1a60]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x1717ec) [0x150a799997ec]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(+0x2b4387) [0x150a79adc387]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(PMPI_Allreduce+0x561) [0x150a799376e1]
73: /apps/oneapi/mpi/2021.5.1/lib/release/libmpi.so.12(MPI_File_open+0x17d) [0x150a7a4492bd]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpio_create+0x199) [0x150a73e592c9]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallel-netcdf-1.12.2-cwokdeb/lib/libpnetcdf.so.4(ncmpi_create+0x4e7) [0x150a73daf4a7]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile_int+0x2e6) [0x150a7c436696]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpioc.so(PIOc_createfile+0x41) [0x150a7c432451]
73: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/parallelio-2.5.10-2wulfav/lib/libpiof.so(piolib_mod_mp_createfile_+0x25e) [0x150a7c1caabe]

Jun suggested a couple of further tests. I'll keep you updated.

ShanSunNOAA commented 3 months ago

@DeniseWorthen Thank you for your prompt response and the clever approach. Please let me know if there is anything I can assist with.

DeniseWorthen commented 3 months ago

I added the following to the cice_wrapper

index d0aafbb..4668f03 100644
--- a/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
+++ b/cicecore/drivers/nuopc/cmeps/cice_wrapper_mod.F90
@@ -72,7 +72,7 @@ subroutine ufs_logfhour(msg,hour)
     real(dbl_kind),    intent(in)    :: hour
     character(len=char_len)          :: filename
     integer(int_kind)                :: nunit
-    write(filename,'(a,i3.3)')'log.ice.f',int(hour)
+    write(filename,'(a,i4.4)')'log.ice.f',int(hour)
     open(newunit=nunit,file=trim(filename))
     write(nunit,'(a)')'completed: cice'
     write(nunit,'(a,f10.3)')'forecast hour:',hour

And did the following tests: 1) Restarted from the above run at fh=2040. The model completed 100d. So I don't think there is any model behaviour that is causing a problem somewhere near ~90 days. 2) I made a run with CICE history turned off, and the model ran out to 100d. So the issue is definitely in the CICE history write. 3) I made a run adding the following to the job_card (see https://convergecfd.com/getting-started-guide-v3/environment-configuration-erro.htm)

export I_MPI_SHM_HEAP_VSIZE=16384

and the model ran out to 100days.

ShanSunNOAA commented 3 months ago

@DeniseWorthen Thanks for your investigation. Where to adjust or add "I_MPI_SHM_HEAP_VSIZE" when running UFS in the workflow environment, i.e., config.ice or config.fcst? Thanks.

DeniseWorthen commented 3 months ago

@ShanSunNOAA I don't know where to add this in the workflow. It is not an ice configuration specifically. I don't know where the equivalent of the job card gets generated. For the RTs, I added it just after the OMP_STACKSIZE

/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_2767165/dt900.shmheap/job_card

junwang-noaa commented 3 months ago

Another solution is to use intel 2023.02.

https://community.intel.com/t5/Intel-MPI-Library/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537

Hera admin confirmed the Intel/2023.2.1 and impi/2023.2.1 are available on hera, it's worth to try the new version of intel compiler to see if this issue is resolved.

DeniseWorthen commented 3 months ago

@junwang-noaa This requires a spack-stack to be built w/ these compilers, correct?

junwang-noaa commented 3 months ago

Yes, a spack-stack issue is created.

ShanSunNOAA commented 3 months ago

Forgot to mention that Jessica suggested adding "export I_MPI_SHM_HEAP_VSIZE=16384" in the HERA.env. This has been working well for me, using the workflow to run UFS for up to 9 months. Thank you all.

DeniseWorthen commented 3 months ago

I updated the model by one commit (to the 1.6 SS) and used the updated intel compiler build provided in https://github.com/JCSDA/spack-stack/issues/1147

diff --git a/modulefiles/ufs_hera.intel.lua b/modulefiles/ufs_hera.intel.lua
index 993e0537..d27136ce 100644
--- a/modulefiles/ufs_hera.intel.lua
+++ b/modulefiles/ufs_hera.intel.lua
@@ -2,12 +2,12 @@ help([[
 loads UFS Model prerequisites for Hera/Intel
 ]])

-prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/Core")
+prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/ue-intel-2023.2.0/install/modulefiles/Core")

-stack_intel_ver=os.getenv("stack_intel_ver") or "2021.5.0"
+stack_intel_ver=os.getenv("stack_intel_ver") or "2023.2.0"
 load(pathJoin("stack-intel", stack_intel_ver))

-stack_impi_ver=os.getenv("stack_impi_ver") or "2021.5.1"
+stack_impi_ver=os.getenv("stack_impi_ver") or "2021.10.0"
 load(pathJoin("stack-intel-oneapi-mpi", stack_impi_ver))

 cmake_ver=os.getenv("cmake_ver") or "3.23.1"

I re-ran the cpld c48 case w/o the export I_MPI_SHM_HEAP_VSIZE=16384. The model completed fh 2400, the same as the shmheap test case.

junwang-noaa commented 3 months ago

Great! So this does resolved the issue! Thanks for testing.

ShanSunNOAA commented 3 months ago

@DeniseWorthen Thanks for testing this solution, without increasing "I_MPI_SHM_HEAP_VSIZE". That is a great news.

DeniseWorthen commented 3 months ago

I've deleted my previous comment about letting the intel 2023.2 test to run as long as possible. I was using the old executable. I will try a clean test.

DeniseWorthen commented 3 months ago

I repeated my test using the 2023.2 intel w/o the heap_vsize variable and it completed fh3168 (132 days) before failing w/ the same error

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm-shm_numa_layout[my_numa_node].base_addr

I'll try a second test adding the vsize variable.

DeniseWorthen commented 3 months ago

The 2023.2 intel case with the addition of the heap_vzsize variable completed fh6816 (284 days) before timing-out.

junwang-noaa commented 3 months ago

@DeniseWorthen Do we know if setting "I_MPI_SHM_HEAP_VSIZE=16384" only will allow the test case to run through 284 days?

DeniseWorthen commented 3 months ago

@junwang-noaa That was my next step. I will re-compile w/ the default SS16 and re-run the test.

DeniseWorthen commented 3 months ago

Note, the tests with the intel 2023.2 also included the addition of a call pio_syncfile(File) in CICE, after the history write.

index b8971a8..22ac8a7 100644
--- a/cicecore/cicedyn/infrastructure/io/io_pio2/ice_history_write.F90
+++ b/cicecore/cicedyn/infrastructure/io/io_pio2/ice_history_write.F90
@@ -1234,6 +1234,8 @@ subroutine ice_write_hist (ns)
       !-----------------------------------------------------------------
       call pio_seterrorhandling(File, PIO_INTERNAL_ERROR)

+      call pio_syncfile(File)
+
       call pio_freedecomp(File,iodesc2d)
       call pio_freedecomp(File,iodesc3dv)
       call pio_freedecomp(File,iodesc3dc)

I had tested this previously with SS16 and w/o the heap variable and it did not resolve the error. I will leave it in for consistency.

DeniseWorthen commented 3 months ago

I recompiled w/ the default SS16 modules and re-ran using only the heap_vsize variable with the Q wall clock set to the maxiumum=8hours. In this case, the model was unstable and failed with

0: FATAL from PE     0: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
 0:
42: Image              PC                Routine            Line        Source
42: fv3_s2s.dev.exe    0000000004FF02A7  Unknown               Unknown  Unknown
42: fv3_s2s.dev.exe    00000000042C6C09  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
42: fv3_s2s.dev.exe    000000000458EDE2  mpp_efp_mod_mp_mp         195  mpp_efp.F90
42: fv3_s2s.dev.exe    00000000044D56B1  mpp_domains_mod_m         143  mpp_global_sum.fh
42: fv3_s2s.dev.exe    0000000001E319CF  fv_grid_utils_mod        3077  fv_grid_utils.F90
42: fv3_s2s.dev.exe    0000000001E6BC73  fv_mapz_mod_mp_la         794  fv_mapz.F90

The failure occurred at fh2574 (107 days).

DeniseWorthen commented 2 months ago

Per the conversation in the Scalability meeting today, the previous discussion on locating memory leaks in the system is https://github.com/ufs-community/ufs-weather-model/discussions/779

DeniseWorthen commented 2 months ago

Using the intel2023.2 compile w/o the heap size variable, I ran for 100days with the ProfileMemory=true setting and plotted the VmPeak value upon exiting the ModelAdvance (for the atm, this was after phase2) relative to the first value (ie, vmpeak(t)/vmpeak(t=0). I plotted every-other PE for ATM (top), OCN (mid), and ICE (bottom)

Screenshot 2024-07-12 at 9 27 22 AM Screenshot 2024-07-12 at 9 27 34 AM Screenshot 2024-07-12 at 9 28 05 AM
DeniseWorthen commented 2 months ago

I did two tests, one with the ATM C48 test and one w/ the C96 P8 test. The C48 atm test doesn't use MERRA, and uses low res fix files (all .grb). The C96 atm test uses MERRRA and the tiled fix files. Both show the same jump on the 15th of the month. The C48 (left) and C96 (right):

Screenshot 2024-07-15 at 5 34 22 PM Screenshot 2024-07-15 at 5 35 42 PM
XiaqiongZhou-NOAA commented 2 months ago

I have a similar problem on WCOSS2. The coupled forecasts always crashed after reaching 135 days on WCOSS2. export I_MPI_SHM_HEAP_VSIZE=16384 is added in wcoss2.env but it does not help. The following is the error message: 324344 nid003457.cactus.wcoss2.ncep.noaa.gov 0: in atmos_model update, fhzero= 6.00000000000000 fhour= 3240.000 324345 0.0000000E+00 324346 nid003457.cactus.wcoss2.ncep.noaa.gov 0: zeroing coupling accumulated fields at kdt= 9720 324347 zeroing coupling accumulated fields at kdt= 9720 324348 write out restart at n_atmsteps= 9719 seconds= 11664000 324349 integration length= 3239.667 324350 nid003457.cactus.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps = 9719 time is 0.106315 324351 aft fcst run output time= 11664000 FBcount= 8 na= 9720 324352 nid004044.cactus.wcoss2.ncep.noaa.gov 373: MPICH Notice [Rank 373] [job id d90a271d-396d-409c-a11c-1a8f6d64b40b] [Mon Jul 29 20:29:51 2024] [nid004044] - Abort(7034 7535) (rank 373 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack: 324353 PMPI_Allreduce(472)..................: MPI_Allreduce(sbuf=0x7ffe305bd28c, rbuf=0x7ffe305bd29c, count=1, datatype=MPI_INT, op=op=0x98000000, comm=comm=0xc4004298) fa iled 324354 PMPI_Allreduce(457)..................: 324355 MPIR_CRAY_Allreduce(478).............: 324356 MPIDI_CRAY_Setup_Shared_Mem_Coll(838): 324357 MPIR_Comm_dup_impl(51)...............: 324358 MPII_Comm_copy(544)..................: 324359 MPIR_Get_contextid_sparse_group(679).: Too many communicators (0/2048 free on this process; ignore_id=0)

LarissaReames-NOAA commented 2 weeks ago

As a new member of the SFS team, @junwang-noaa has asked me to look in to this issue. With the help of @DeniseWorthen I was able to get a few test runs in with memory profiling.

I first wanted to make sure this memory issue wasn't present in the hash that @ShanSunNOAA reported as working previously (45c8b2a). Because the module files in that branch are no longer available I used the modules in the current develop branch along with a test using the same intel2023.2.0 installation that Denise used above.. The same memory behavior is present in the C48 ATM test at this commit with both compilers (2021 first, 2023 second) though they aren't identical. c48_mem c48_mem_intel2023

These results made me think perhaps this is a library issues associated (somehow) with NetCDF reads. I decided to try a run of the regional_control ATM test which isn't run long enough to read new climo files on the 15th of the month, but it does read lateral boundary condition files (also in NetCDF format) every 180 minutes. We see similar memory jumps for these runs at every LBC read. Again, intel2021 first, 2023 second. regional_mem regional_mem_intel2023

Assuming this memory behavior is the source of the long run failures, then it looks like perhaps this is not a code issue if @ShanSunNOAA was able to use this hash successfully in the past but with different modules (and perhaps on non-rocky8 OS on Hera). I have a C48 run going right now with the intel 2023 executable with a full forecast length of 3200 hours to see how far it makes it. I'll report back on that once it's complete and will also open a new issue specific to these memory problems as they seem more pervasive than a single coding bug.

EDIT: Note I am using more than the standard allocation of 8 processes on a single node for the C48 tests to decrease runtime. I'll make sure to switch back to a single node when I check for long run failure.

LarissaReames-NOAA commented 1 week ago

Based on the test from my previous comment, I decided to start investigating other sources of the Assertion Failed failure message that @ShanSunNOAA originally reported by running S2S regression tests. After lots of unstable forecasts with the cpld_control_c48, I've been able to run several tests using cpld_control_nowave_noaero_p8 in #8a5f711, the first hash after Shan's working version.

After confirming that the Assertion Failed error does not occur with Shan's working hash (it does not), I can confirm that the Assertion Failed issues occur at #8a5f711. When I set CICE_IO = "NetCDF", instead of the default (as used in the failed runs) "PIO" as introduced in WM PR #2145, and restart/history_format='hdf5' or 'default' in ice_in, the failure disappears and the model runs to the 8 hour max wall clock time. As additional tests, I ran again with CICE_IO="PIO" and restart/history_format="pnetcdf1" and "pnetcdf5" and both of these runs also fail. This pretty clearly points to either the PIO library or its implementation as the source of the Assertion Failed errors.

I did one last test this afternoon to see if these are accumulating errors or to see if something specific in various ICs might be triggering it somehow. I left the ICE restart frequency the same (every 12 hours) but doubled the ICE history frequency from 6 to 3 hours. The model failed in 60% of the time (1210 FHR vs 2010 FHR) after writing 504 files (404 history & 100 restart) whereas the 6hr frequency wrote 501 files (335 hist, 167 restart). This suggest that there is some accumulating memory issue that is compounded every time an ICE history or restart file is written.