HRv4 hangs on orion and hercules

RuiyuSun commented 3 weeks ago

George V. noticed that The HRv4 does not work on Hercules or Orion. It hangs sometime after WW3 starts. No relevant message in the log files about the hanging.

To Reproduce: Run an HRv4 experiment on Hercules or Orion

Additional context

Output

GeorgeVandenberghe-NOAA commented 1 week ago

We should be able to figure out analytically what resources the write grid components require. I was talking to Jun about that. Model state is order ~120GB and UPP makes a second copy for.. reasons. THat gets us to 240GB and UPP scratch spaces may eat up another 100GB or so spread through the nodes used by the write grid component. WCOSS2 has 512GB so that should be easily enough and I am puzzled it's not. One think that helps is to make sure only one write grid component is on a node and to do that the ranks per I/O group should be an integral multiple of ppn which is typically 128/4 for these runs. 120 and 60 don't meet this requirement. 128 and 64 (and 32) do.

On hercules/orion it should be 40/cpus-per-task, typically 20 or 10.

On Mon, Nov 18, 2024 at 9:19 AM Jessica Meixner @.***> wrote:

Also - udpate on my WCOSS2 runs. Reducing the total number of write grid components seems to have helped. I'll post more details on Monday.

On WCOSS2 running with a different wave grid, I got a segfault (full log file is on dogwood /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test03/COMROOT/Test03/logs/2020021300/gfs_fcst_seg0.log):

zeroing coupling accumulated fields at kdt= 12nid001107.dogwood.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps = 11 time is 0.792372nid001652.dogwood.wcoss2.ncep.noaa.gov 9216: d3d_on= Fnid001652.dogwood.wcoss2.ncep.noaa.gov: rank 9280 died from signal 9nid001553.dogwood.wcoss2.ncep.noaa.gov 2056: forrtl: error (78): process killed (SIGTERM)

Rank 9280 is a write grid component task as ATM_petlist_bounds: 0 10175 ATM_omp_num_threads: 4 layout = 24,16

For this run I had: write_groups: 4 write_tasks_per_group: 60

Changing this to: write_groups: 2 write_tasks_per_group: 120

The successful log file is here: /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test04/COMROOT/Test04/logs/2020021300/gfs_fcst_seg0.log

I suspect the issues we see on WCOSS2 are similar to what we've seen on hercules/orion but manifesting in segfaults versus hanging, but I could be wrong.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2483175476, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FX2L53JUS2W6H2DR5T2BHZNHAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGE3TKNBXGY . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

DeniseWorthen commented 1 week ago

In the job cards I got from Rahul's sandboxes, the nodes are specified as either

#SBATCH --nodes=63-63

for C768 and

#SBATCH --nodes=82-82

for C1152.

I'm not familiar w/ this notation. What does 82-82 mean?

GeorgeVandenberghe-NOAA commented 1 week ago

This looks like a gaea6 job card. Maybe useful on gaea5 or maybe now slurm supports this everywhere and I just didn't know it

On Mon, Nov 18, 2024 at 5:35 PM Denise Worthen @.***> wrote:

In the job cards I got from Rahul's sandboxes, the nodes are specified as either

SBATCH --nodes=63-63

for C768 and

SBATCH --nodes=82-82

for C1152.

I'm not familiar w/ this notation. What does 82-82 mean?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2484274750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FUBXA4PRAUKIE35B6L2BJTRDAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUGI3TINZVGA . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

DeniseWorthen commented 1 week ago

@GeorgeVandenberghe-NOAA Since you seem to know, what does specifying the nodes like this mean?

GeorgeVandenberghe-NOAA commented 1 week ago

I have no idea.

On Mon, Nov 18, 2024 at 6:29 PM Denise Worthen @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Since you seem to know, what does specifying the nodes like this mean?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2484381210, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQNGG4AMPMVJSDZY4T2BJZ4XAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBUGM4DCMRRGA . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

JacobCarley-NOAA commented 6 days ago

In the job cards I got from Rahul's sandboxes, the nodes are specified as either
#SBATCH --nodes=63-63
for C768 and
#SBATCH --nodes=82-82
for C1152.

I'm not familiar w/ this notation. What does 82-82 mean?

Hi @DeniseWorthen. Here's the relevant snippet from the slurm documentation on sbatch:

-N, --nodes=[-maxnodes]| Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. Node count can be also specified as size_string. The size_string specification identifies what nodes values should be used. Multiple values may be specified using a comma separated list or with a step function by suffix containing a colon and number values with a "-" separator. For example, "--nodes=1-15:4" is equivalent to "--nodes=1,5,9,13". The partition's node limits supersede those of the job. If a job's node limits are outside of the range permitted for its associated partition, the job will be left in a PENDING state. This permits possible execution at a later time, when the partition limit is changed. If a job node limit exceeds the number of nodes configured in the partition, the job will be rejected. Note that the environment variable SLURM_JOB_NUM_NODES will be set to the count of nodes actually allocated to the job. See the ENVIRONMENT VARIABLES section for more information. If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requested resources as expressed by per-job specification options, e.g. -n, -c and --gpus. The job will be allocated as many nodes as possible within the range specified and without delaying the initiation of the job. The node count specification may include a numeric value followed by a suffix of "k" (multiplies numeric value by 1,024) or "m" (multiplies numeric value by 1,048,576). NOTE: This option cannot be used in with arbitrary distribution.

So, I'm pretty sure it's just specifying the minimum and maximum number of nodes the job can run with. In this case they are the same.

DeniseWorthen commented 6 days ago

@JacobCarley-NOAA Thanks. That makes sense.

DeniseWorthen commented 5 days ago

I copied Rahul's C768 run directory (also created my own fix subdir) and compiled both top-develop and the HR4 tag in debug mode using

./compile.sh hercules "-DAPP=S2SW -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DPDLIB=ON -DDEBUG=ON" s2sw.dev.db intel  NO NO 2>&1 | tee s2sw.dev.db.log

When I try c768s2sw_gfsfcst.sh, both dev and the tag give me a seg fault (they don't even start):

 159: [hercules-01-36:826937:0:826937] Caught signal 8 (Floating point exception: floating-point invalid operation)
6082: ==== backtrace (tid: 630246) ====
6082:  0 0x000000000005f14c ucs_callbackq_cleanup()  ???:0
6082:  1 0x000000000005f40a ucs_callbackq_cleanup()  ???:0
6082:  2 0x0000000000054d90 __GI___sigaction()  :0
6082:  3 0x0000000000048f52 ucp_proto_perf_envelope_make()  ???:0
6082:  4 0x0000000000054bbc ucp_proto_select_elem_trace()  ???:0
6082:  5 0x0000000000056261 ucp_proto_select_lookup_slow()  ???:0
6082:  6 0x0000000000056725 ucp_proto_select_short_init()  ???:0
6082:  7 0x000000000004bc1c ucp_worker_add_rkey_config()  ???:0
6082:  8 0x00000000000648ff ucp_proto_rndv_ctrl_init()  ???:0
6082:  9 0x0000000000064aff ucp_proto_rndv_rts_init()  ???:0
6082: 10 0x0000000000054a42 ucp_proto_select_elem_trace()  ???:0
6082: 11 0x0000000000056261 ucp_proto_select_lookup_slow()  ???:0
6082: 12 0x0000000000056725 ucp_proto_select_short_init()  ???:0
6082: 13 0x000000000004b789 ucp_worker_get_ep_config()  ???:0
6082: 14 0x00000000000a159c ucp_wireup_init_lanes()  ???:0
6082: 15 0x00000000000339ce ucp_ep_create_to_worker_addr()  ???:0
6082: 16 0x0000000000034b33 ucp_ep_create()  ???:0
6082: 17 0x00000000000078bb mlx_av_insert()  mlx_av.c:0
6082: 18 0x00000000006595fb fi_av_insert()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_domain.h:414
6082: 19 0x00000000006595fb insert_addr_table_roots_only()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:448
6082: 20 0x00000000006595fb MPIDI_OFI_mpi_init_hook()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/ofi/ofi_init.c:1604
6082: 21 0x00000000002296f4 MPID_Init()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_init.c:1544
6082: 22 0x00000000004ce935 MPIR_Init_thread()  /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:175
6082: 23 0x00000000004ce935 PMPI_Init_thread()  /build/impi/_buildspace/release/../../src/mpi/init/initthread.c:318
6082: 24 0x000000000117376d ESMCI::VMK::init()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:423
6082: 25 0x00000000012f9e3f ESMCI::VM::initialize()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:3200
6082: 26 0x00000000009da3c5 c_esmc_vminitialize_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/interface/ESMCI_VM_F.C:1186
6082: 27 0x0000000000cc6810 esmf_vmmod_mp_esmf_vminitialize_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/interface/ESMF_VM.F90:9321
6082: 28 0x0000000000b1bc47 esmf_initmod_mp_esmf_frameworkinternalinit_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/ESMFMod/src/ESMF_Init.F90:711
6082: 29 0x0000000000b2140e esmf_initmod_mp_esmf_initialize_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/ESMFMod/src/ESMF_Init.F90:401
6082: 30 0x0000000000431e9c MAIN__()  /work/noaa/nems/dworthen/ufs-weather-model/driver/UFS.F90:97
6082: 31 0x0000000000431abd main()  ???:0
6082: 32 0x000000000003feb0 __libc_start_call_main()  ???:0
6082: 33 0x000000000003ff60 __libc_start_main_alias_2()  :0
6082: 34 0x00000000004319d5 _start()  ???:0
6082: =================================

Run directory (hercules): /work2/noaa/stmp/dworthen/c768s2sw.2

GeorgeVandenberghe-NOAA commented 5 days ago

For hercules my current snapshot is on/work2/noaa/noaatest/gwv/herc/hr4j/the rundir is ./dc the source dir is ./sorc and to build, I cd to ./sorc/ufs_model.fd, load the compilers and set export PREFIX=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240 export NETP=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240 export CMAKE_PREFIX_PATH=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240 export ESMFMKFILE=/work/noaa/noaatest/gwv/herc/simstacks/simstack.1008/netcdf140.492.460.mapl241.fms2301.crtm240/ESMF_8_5_0/lib/esmf.mk With this done, the following script rm -rf build mkdir build cd build export CMAKE_PREFIX_PATH=$NETP/fms.2024.01:$NETP cmake .. -DAPP=S2SWA -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_p8_ugwpv1,FV3_GFS_v17_coupled_p8_ugwpv1,FV3_global_nest_v1 -DPDLIB=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON make -j 8 VERBOSE=1 builds it. I am sick and tired of broken stacks and just gave up and built my own :-( However I do think this would work with the current Hercules spack-stack.. haven't tried it.

DeniseWorthen commented 5 days ago

I checked again that my configuration was a copy of Rahul's c768 run directory. I used the compile in debug mode and it fails immediately w/ the error I posted above. That run directory is /work2/noaa/stmp/dworthen/c768s2sw

I then used Dusan's instructions posted earlier for using traditional threading. He did it by removing all other components except ATM, so I made a similar adjustment w/ all components included. Using the same executable, it ran for 25 minutes of calendar time. That run directory is /work2/noaa/stmp/dworthen/c768s2sw.2. I used here job_card so check the out and err files.

I haven't yet tried the 2nd case w/ a non-debug compile. I did confirm that the first case hangs w/ a release compile.

Also, I made the WW3 points list only 240 long in both cases. (See ww3_shel.nml, which is being used in my tests since it is easy then to point to a different point list.)

GeorgeVandenberghe-NOAA commented 4 days ago

Okay on Hercules 24x32 ATM decomposition, two threads per task ESMF RESOURCE CONTROL FALSE , 4 I/O groups, 160 MPI ranks per group. 240 OCN tasks, 120 ICE tasks 1400 WAVE tasks. 32 tasks per node

On Orion 24x24 ATM decomposition 2 groups of 240 I/O tasks 240 OCN 120 ICE 998 WAVE tasks. 2 threads per task. 16 tasks per node

GeorgeVandenberghe-NOAA commented 4 days ago

The hangs I reported earlier seem to happen at higher decompositions and resource usages. Running that down.

GeorgeVandenberghe-NOAA commented 4 days ago

The problem what we can't quickly find WHERE in the various component(s) we are getting stuck remains an issue.

DeniseWorthen commented 3 days ago

Based on my testing, the issue seems to be fundamentally one w/ using ESMF managed threading. I've been doing all my testing in /work2/noaa/stmp/dworthen/hangtests, with sub-dirs there for ESMF-managed threading (ESMFT) and traditional threading (TRADT).

I can run the test case w/ traditional threading w/ the G-W executable (from Rahu's sandbox), with my own compile and with my own debug compile.

I cannot run with ESMF managed threading either w/ the G-W executable, my own compile or my own debug compile. I've tried w/ and w/o waves. In all cases, I either get a hang or, with debug, I get the error I posted about regarding floating point exception.

@JacobCarley-NOAA I think at this point it is not a WAV issue, assuming you reduce the points list to something small. I think others are better suited to debugging it. That will allow me to return my focus to the grid-imprint issue (#2466), which I know is also very high priority.

BrianCurtis-NOAA commented 3 days ago

I wonder if there was a build option missed that is causing managed threading to not work correctly?

BrianCurtis-NOAA commented 3 days ago

What I mean is with the ESMF library built in those stacks.

JessicaMeixner-NOAA commented 3 days ago

@JacobCarley-NOAA - as a near-term work around I plan to request a feature in the global-workflow to add traditional threading to enable orion/hercules in the near term, unless you'd prefer a different path forward?

GeorgeVandenberghe-NOAA commented 3 days ago

I get it at high rank counts (8000 or so) without managed threading on orion.

On Fri, Nov 22, 2024 at 6:44 PM Brian Curtis @.***> wrote:

I wonder if there was a build option missed that is causing managed threading to not work correctly?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2494517912, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWZ5AD4TRRHV5NFW3L2B53PDAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUGUYTOOJRGI . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

JacobCarley-NOAA commented 3 days ago

@DeniseWorthen Thanks so much for your efforts. Please proceed to return to the grid imprint issue (#2466).

@JessicaMeixner-NOAA I think the ability to run with traditional threading (no managed threading) was added to GW earlier this year (see GW Issue 2277). However, I'm not sure if it's working. If it's not, I'd recommend proceeding with opening a new issue for this feature. Since something might already exist, hopefully it's not too much of a lift to get it going. This will hopefully get you working in the short-ish term.

Now, there's still something going on that we need understand. @GeorgeVandenberghe-NOAA Would you be able to continue digging into this issue?

JessicaMeixner-NOAA commented 3 days ago

@JacobCarley-NOAA a comment from @aerorahul earlier in this thread:

Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo.

I'll open a g-w issue (update: g-w issue: https://github.com/NOAA-EMC/global-workflow/issues/3122)

GeorgeVandenberghe-NOAA commented 3 days ago

I intend to but if I encounter hangs I need people who know the component codes to figure out where and why the hangs are occurring. Debugging is very slow on Orion where I have encountered a hang with 7008 mpi ranks, 1400 wave ranks and 24x32 atm decomposition WITHOUT esmf managed threading. It looks like an issue with large numbers of ranks which we get first with ESMF managed threading but eventually at higher resolution, without this setting too. This is DIFFERENT from the ESMF bug where we still can't spawn more than 21K ranks without a segfault in the ESMF code somewhere.

On Fri, Nov 22, 2024 at 8:10 PM JacobCarley-NOAA @.***> wrote:

@DeniseWorthen https://github.com/DeniseWorthen Thanks so much for your efforts. Please proceed to return to the grid imprint issue (#2466 https://github.com/ufs-community/ufs-weather-model/issues/2466).

@JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA I think the ability to run with traditional threading (no managed threading) was added to GW earlier this year (see GW Issue 2277 https://github.com/NOAA-EMC/global-workflow/issues/2277). However, I'm not sure if it's working. If it's not, I'd recommend proceeding with opening a new issue for this feature. Since something might already exist, hopefully it's not too much of a lift to get it going. This will hopefully get you working in the short-ish term.

Now, there's still something going on that we need understand. @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA Would you be able to continue digging into this issue?

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2494715061, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FTJAR5H5W2G2BW2YPT2B6FUFAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUG4YTKMBWGE . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

JacobCarley-NOAA commented 3 days ago

Thanks @GeorgeVandenberghe-NOAA! Just send me a quick note offline (email is fine) when you need a component expert to jump in and I'll be happy to coordinate accordingly.

GeorgeVandenberghe-NOAA commented 11 hours ago

It looks like the hangs are related to the total number of WAVE tasks but are also related to total resource usage.

I have verified that a 16x16 decomposition (ATM) with traditional threads (two per rank) and 1400 wave ranks does not hang on either Orion or Hercules but a 24x32 decomposition with 1400 wave ranks does. 998 rank runs do get through with a 24x32 decomposition. So it looks like total job resources is a contributing issue. It isn't just a hard barrier that we can't run 1400 wave tasks on orion or hercules.

ufs-community / ufs-weather-model