ufs-community / ufs-weather-model

UFS Weather Model
Other
142 stars 248 forks source link

HRv4 hangs on orion and hercules #2486

Open RuiyuSun opened 3 weeks ago

RuiyuSun commented 3 weeks ago

George V. noticed that The HRv4 does not work on Hercules or Orion. It hangs sometime after WW3 starts. No relevant message in the log files about the hanging.

To Reproduce: Run an HRv4 experiment on Hercules or Orion

Additional context

Output

GeorgeVandenberghe-NOAA commented 3 weeks ago

This happens at high ATM resolution C1152.

RuiyuSun commented 2 weeks ago

I made a HRv4 test run on orion as well. As reported previously, it hung at the beginning of the run.

The log file is at /work2/noaa/stmp/rsun/ROTDIRS/HRv4

HOMEgfs=/work/noaa/global/rsun/git/global-workflow.hr.v4 (source) EXPDIR=/work/noaa/global/rsun/para_gfs/HRv4 COMROOT=/work2/noaa/stmp/rsun/ROTDIRS RUNDIRS=/work2/noaa/stmp/rsun/RUNDIRS

LarissaReames-NOAA commented 2 weeks ago

@RuiyuSun Denise reports that the privacy settings on your directories are preventing her from accessing them. Could you check on that and report back when it's fixed so others can look at your forecast?

RuiyuSun commented 2 weeks ago

@DeniseWorthen I made the changes. Please try again.

JessicaMeixner-NOAA commented 2 weeks ago

I've made a few test runs on my end and here are some observations:

Consistently all runs I have made, also the same as @RuiyuSun runs stall out here:

    0:  fcst_initialize total time:    200.367168849800
    0:  fv3_cap: field bundles in fcstComp export state, FBCount=            8
    0:  af allco wrtComp,write_groups=           4
 9216: NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to    32768.
 9216:  &MPP_IO_NML
 9216:  HEADER_BUFFER_VAL       =       16384,
 9216:  GLOBAL_FIELD_ON_ROOT_PE = T,
 9216:  IO_CLOCKS_ON    = F,
 9216:  SHUFFLE =           0,
 9216:  DEFLATE_LEVEL   =          -1,
 9216:  CF_COMPLIANCE   = F
 9216:  /
 9216: NOTE from PE     0: MPP_IO_SET_STACK_SIZE: stack size set to     131072.
 9216: NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 16000000.
 9216:  num_files=           2
 9216:  num_file=           1 filename_base= atm output_file= netcdf_parallel
 9216:  num_file=           2 filename_base= sfc output_file= netcdf_parallel
 9216:  grid_id=            1  output_grid= gaussian_grid
 9216:  imo=        4608 jmo=        2304
 9216:  ideflate=           1
 9216:  quantize_mode=quantize_bitround quantize_nsd=           5
 9216:  zstandard_level=           0
    0:  af wrtState reconcile, FBcount=           8
    0:  af get wrtfb=output_atm_bilinear rc=           0

With high resolution runs (C768 & C1152) for various machines we've had to use different number of write grid tasks. I've tried a few and all are stalling though. This is using ESMF managed threading, so one thing to try might be moving away from that?

To run a high res test case:

git clone --recursive https://github.com/NOAA-EMC/global-workflow
cd global-workflow/sorc
./build_all.sh
./link_workflow.sh
cd ../../
mkdir testdir 
cd testdir 
source ../global-workflow/workflow/gw_setup.sh 
HPC_ACCOUNT=marine-cpu pslot=C1152t02 RUNTESTS=`pwd` ../global-workflow/workflow/create_experiment.py --yaml ../global-workflow/ci/cases/hires/C1152_S2SW.yaml

Change C1152 to C768 to run that resolution and also change your HPC_ACCOUNT, pslot, as desired. Lastly, if you want to turn off waves, you change that in C1152_S2SW.yaml. If you want to change resources, look in global-workflow/parm/config/gfs/config.ufs in the C768/C1152 section.

If you want to run S2S only, change the app in global-workflow/ci/cases/hires/C1152_S2SW.yaml

My latest run log files can be found at: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t0/COMROOT/C1152t0/logs/2019120300/gfs_fcst_seg0.log (several runs are in progress, but they've all been running for over an hour an all hung on the same spot, despite changing write grid tasks).

JessicaMeixner-NOAA commented 2 weeks ago

@GeorgeVandenberghe-NOAA suggested trying 2 write groups with 240 tasks in them. I meant to try that but tried 2 write groups with 360 tasks per group unintentionally, but I did turn on all PET files as @LarissaReames-NOAA thought that might have helpful info.

The rundirectory is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800

The log file is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t06/COMROOT/C1152t06/logs/2019120300/gfs_fcst_seg0.log

The PET logs to me also point to write group issues. Any help with this would be greatly appreciated.

Tagging @aerorahul for awareness.

JacobCarley-NOAA commented 1 week ago

Thanks to everyone for the work on this. Has anyone tried this configuration with the write component off? That might help isolate where there problem is (hopefully) and then we can direct this accordingly for further debugging.

JessicaMeixner-NOAA commented 1 week ago

I have not tried this without the write component.

DusanJovic-NOAA commented 1 week ago

@JessicaMeixner-NOAA and others, I grabbed the run directory from the last experiment you ran (/work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800), changed it to run just ATM component and converted it to run with traditional threading. It is currently running in /work2/noaa/stmp/djovic/stmp/fcst.272800, and it passed the initialization phase and finished writing 000 and 003 hour outputs successfully. I submitted the job with just 30 min wall-clock time limit, so it will fail soon.

I suggest you try running full coupled version with traditional threading if it's easy to reconfigure it.

jiandewang commented 1 week ago

some good news: I tried HR4 tag, the only thing I changed is WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS from 20 to 10 and model is running Note my run is S2S. See log file at /work/noaa/marine/Jiande.Wang/HERCULES/HR4/work/HR4-20191203/COMROOT/2019120300/HR4-20191203/logs/2019120300/gfsfcst_seg0.log

jiandewang commented 1 week ago

my 48hr run finished

JessicaMeixner-NOAA commented 1 week ago

@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work?

I'm also trying on hercules to replicated @jiandewang's success but with S2SW.

jiandewang commented 1 week ago

I also lanched one S2SW but it's still in pending status

JessicaMeixner-NOAA commented 1 week ago

WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 with S2S did not work on orion: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t03/COMROOT/C1152t03/logs/2019120300/gfs_fcst_seg0.log

jiandewang commented 1 week ago

mine is on hercules

jiandewang commented 1 week ago

@JessicaMeixner-NOAA my gut feeling is the issue is related to the memory/node, hercules has more than orion. Maybe you can try 5 on orion

aerorahul commented 1 week ago

@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work?

I'm also trying on hercules to replicated @jiandewang's success but with S2SW.

Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo.

@DusanJovic-NOAA To run w/ traditional threading, what else did you update in the test case borrowed from @JessicaMeixner-NOAA?

DusanJovic-NOAA commented 1 week ago

I only changed ufs.configure: 1) remove all components except ATM 2) change globalResourceControl: from true to false 3) change ATM_petlist_bounds: to be 0 3023 - this numbers are lower and upper bounds of MPI ranks (0 based) used by the ATM model, in this case 24166 + 2360, where 24 and 16 are layout values from input.nml and 2360 are write comp values from model_configure 4) change ATM_omp_num_threads: from 4 to 1

And, I added job_card by copying one of the job_card from regression test run and changed: 1) export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads 2) srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads 3) #SBATCH --nodes=152

SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes 152 is the minimal number of nodes such that 152*80 >= 3024

aerorahul commented 1 week ago

I only changed ufs.configure:

  1. remove all components except ATM
  2. change globalResourceControl: from true to false
  3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 + 2_360, where 24 and 16 are layout values from input.nml and 2_360 are write comp values from model_configure

And, I added job_card by copying one of the job_card from regression test run and changed:

  1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
  2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads
  3. SBATCH --nodes=152

    SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes 152 is the minimal number of nodes such that 152*80 >= 3024

Ok. Yes. That makes sense for the atm-only. Does your ufs.configure have a line for

ATM_omp_num_threads:            @[atm_omp_num_threads]

@[atm_omp_num_threads] would have been 4. Did you remove it? Or does it not matter since globalResourceControl is set to false?

The original value for ATM_petlist_bounds must have been 0 755 that you changed to 0 3023, I am assuming.

GeorgeVandenberghe-NOAA commented 1 week ago

OMP_NUM_THREADS performance is inconsistent and generally poor if

ATM_omp_num_threads: @[atm_omp_num_threads]

is not removed when esmf managed threading is set to false.

On Fri, Nov 8, 2024 at 7:52 PM Rahul Mahajan @.***> wrote:

I only changed ufs.configure:

  1. remove all components except ATM
  2. change globalResourceControl: from true to false
  3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 + 2_360, where 24 and 16 are layout values from input.nml and 2_360 are write comp values from model_configure

And, I added job_card by copying one of the job_card from regression test run and changed:

  1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
  2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is a number of MPI ranks, 4 is a number of threads
  3. SBATCH --nodes=152

    SBATCH --ntasks-per-node=80

80 is then number of cores on hercules compute nodes 152 is the minimal number of nodes such that 152*80 >= 3024

Ok. Yes. That makes sense for the atm-only. Does your ufs.configure have a line for

ATM_omp_num_threads: @[atm_omp_num_threads]

@[atm_omp_num_threads] would have been 4. Did you remove it? Or does it not matter since globalResourceControl is set to false?

The original value for ATM_petlist_bounds must have been 0 755 that you changed to 0 3023, I am assuming.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2465641022, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FR2UXPLHUID674GWZLZ7UI7BAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2DCMBSGI . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

DusanJovic-NOAA commented 1 week ago

I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4, I'm not sure if it's ignored when globalResourceControl is set to false

The original value for ATM_petlist_bounds was something like 12 thousand or something like that, that included MPI ranks times 4 threads.

GeorgeVandenberghe-NOAA commented 1 week ago

Yes ESMF managed threading requires several times more ranks and ESMF fails when rank count goes above 21000 or so. This is a VERY serious issue for resolution increases unless it is fixed.. reported in February.

On Fri, Nov 8, 2024 at 7:56 PM Dusan Jovic @.***> wrote:

I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4, I'm not sure if it's ignored when globalResourceControl is set to false

The original value for ATM_petlist_bounds was something like 12 thousand or something like that, that included MPI ranks times 4 threads.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2465650281, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FW3WOMQFATDADHXU53Z7UJQJAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TAMRYGE . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

aerorahul commented 1 week ago

@JessicaMeixner-NOAA I think the global-workflow is coded to use the correct ufs_configure template and set the appropriate values for PETLIST_BOUNDS and OMP_NUM_THREADS in the ufs_configure file. The default in the global-workflow is to use ESMF_THREADING = YES. I am pretty sure one could use traditional threading as well, but is an unconfirmed fact as there was still work being done to confirm traditional threading will work on WCOSS2 with the slignshot updates and whatnot. Details on that are fuzzy to me at the moment.

BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~

GeorgeVandenberghe-NOAA commented 1 week ago

I have MANY test cases that use traditional threading and have converted others from managed to traditional threading. It's generally needed at high resolution to get decent run rates.

On Fri, Nov 8, 2024 at 8:02 PM Rahul Mahajan @.***> wrote:

@JessicaMeixner-NOAA https://github.com/JessicaMeixner-NOAA I think the global-workflow is coded to use the correct ufs_configure template and set the appropriate values for PETLIST_BOUNDS and OMP_NUM_THREADS in the ufs_configure file. The default in the global-workflow is to use ESMF_THREADING = YES. I am pretty sure one could use traditional threading as well, but is an unconfirmed fact as there was still work being done to confirm traditional threading will work on WCOSS2 with the slignshot updates and whatnot. Details on that are fuzzy to me at the moment.

BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2465658188, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FVGPKZCGQO7R37N6HLZ7UKE5AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TQMJYHA . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

aerorahul commented 1 week ago

Ok. @GeorgeVandenberghe-NOAA. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up.

GeorgeVandenberghe-NOAA commented 1 week ago

I don't know because I usually get CWD testcases from others and work from there but yes that's an excellent idea. We probably though should also use a multiple stanza MPI launcher for the different components to minimize core wastage for components that don't thread, particularly WAVE

On Fri, Nov 8, 2024 at 8:11 PM Rahul Mahajan @.***> wrote:

Ok. @GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2465670636, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQG5MHORVYQWBE3TY3Z7ULE7AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY3TANRTGY . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

JessicaMeixner-NOAA commented 1 week ago

Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

JessicaMeixner-NOAA commented 1 week ago

Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

Note this was with added waves - so this might have also failed for @jiandewang if he has used waves.

jiandewang commented 1 week ago

summary for more tests I did on HERCULES: (1) S2S, fv3 layout=8x16, write task per group=10, runs fine, further repeated 3 cases, all fine (2) same as (1) but layout=24x16, hang (3) repeat (1) and (2) but S2SW, all hang

GeorgeVandenberghe-NOAA commented 1 week ago

Does anyone know WHY the coupled configurations are hanging. Does anyone have the knowledge to drill into all components and find where in which routine, one is stuck. Our older forecast models had this property.. the MPI "timbers" were clearly exposed and we could see which line of code was stuck or silently failing (silent failures of one rank can also cause hangs)

On Tue, Nov 12, 2024 at 2:25 PM Jessica Meixner @.***> wrote:

Unfortunately I was unable to replicate @jiandewang https://github.com/jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either.

Note this was with added waves - so this might have also failed for @jiandewang https://github.com/jiandewang if he has used waves.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-weather-model/issues/2486#issuecomment-2470675642, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWQUWAVQQ4ZANRBIX32AIFUBAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZQGY3TKNRUGI . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

Lynker Technologies at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

JessicaMeixner-NOAA commented 1 week ago

On orion, turning off the write grid component means that we're now hanging during wave initialization. The log file can be found here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/nowritegridt2/COMROOT/test01/logs/2019120300/gfs_fcst_seg0.log

Instructions from @aerorahul on running the workflow without the write grid component are:

config.base: QUILTING=.false. and ush/parsing_model_configure_FV3.sh QUILTING_RESTART=".false."

FYI @JacobCarley-NOAA

JacobCarley-NOAA commented 1 week ago

On orion, turning off the write grid component means that we're now hanging during wave initialization. The log file can be found here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/nowritegridt2/COMROOT/test01/logs/2019120300/gfs_fcst_seg0.log

Instructions from @aerorahul on running the workflow without the write grid component are:

config.base: QUILTING=.false. and ush/parsing_model_configure_FV3.sh QUILTING_RESTART=".false."

FYI @JacobCarley-NOAA

Thanks @aerorahul for the instructions and @JessicaMeixner-NOAA for running the test! This is helpful. Give me a moment and I'll figure out next steps.

DeniseWorthen commented 1 week ago

At the risk of muddying the waters, I copied the WW3 mod_def and mesh from Jessica's run directory and was able to start up and run the model on hercules on using the DATM-S2SW configuration I've built. This is WW3 at the top of the current dev/ufs-weather-model and WAV is on 592 tasks.

LarissaReames-NOAA commented 1 week ago

At the risk of muddying the waters, I copied the WW3 mod_def and mesh from Jessica's run directory and was able to start up and run the model on hercules on using the DATM-S2SW configuration I've built. This is WW3 at the top of the current dev/ufs-weather-model and WAV is on 592 tasks.

So should the only difference from your experiment and Jessica's be Hercules vs. Orion? Or is the compile/run config different? Trying to narrow down potential sources of difference.

DeniseWorthen commented 1 week ago

My case uses a DATM with MOM6+CICE6 on 1/4 deg tripole and a given WW3 configuration (structured, unstructured etc). So it essentially eliminates any fv3atm related issues.

DeniseWorthen commented 1 week ago

If I someone can provide me a canned run-directory on either Hercules or Orion, I can see if I can figure out what is going on. But I'll need to pause work on issue #2466 to do so.

RuiyuSun commented 6 days ago

@DeniseWorthen @JacobCarley-NOAA A canned case was created on Orion. Please see below for the information: Jobcard is located in the experiment dir: /work/noaa/global/rsun/para_gfs/HRv4 RUNDIR is at /work/noaa/stmp/ruiyusun/ORION/RUNDIRS/HRv4/gfs.2020072500/gfsfcst.2020072500/fcst.665610

This is my first time creating a canned case. Please let me know if anything is missing.

DeniseWorthen commented 6 days ago

Thanks @RuiyuSun. Just to be sure, this is the UFS HRv4 tag (fcc9f84)?

aerorahul commented 6 days ago

On Hercules, I have done my best to setup 2 canned cases; both with APP=S2SW Both cases use 592 tasks for waves (as opposed to 1000 as is the case in HR4). I used this number based on comments from @DeniseWorthen earlier in this issue. This is not the HR4 tag of the model, but a more recent hash 6a4e09e9

Both cases are set to run out to 12 hours on Hercules, in the debug queue to have faster throughput.

The cases have job cards, module files, and environment variables set as would have been via the workflow. These directories have no dependencies on any space in my area. The cases can be copied and run by editing a few paths in the respective jobcards.

C768:/work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/c768s2sw C1152: /work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/c1152s2sw

cp -R /work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/c1152s2sw ./c1152s2sw.run
cd ./c1152s2sw.run
sbatch c1152s2sw_gfsfcst.sh

This should work as long as the user has access to the fv3-cpu allocation.

My sample run directories are in: /work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/RUN

DeniseWorthen commented 6 days ago

@aerorahul Thanks. One question I did have which options are used to compile? I'm assuming what we are compiling is 32-bit?

-DAPP=S2SW -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DPDLIB=ON
RuiyuSun commented 6 days ago

Thanks @RuiyuSun. Just to be sure, this is the UFS HRv4 tag (fcc9f84)?

I Just checked it is HRv3 tag.

/work/noaa/global/rsun/git/global-workflow.hr.v4/sorc/ufs_model.fd$ git branch

DusanJovic-NOAA commented 6 days ago

I grabbed @aerorahul 's c1152s2sw run directory, converted it to traditional threading (in fact I converted atm to use only one thread) and ran the job script. Atm initialization finished in about 4 minutes, and then the rest of 30 min wall clock time was spent in ww3 initialization. I see hundreds if not thousands of lines in log.ww3 like:

*** WAVEWATCH-III WARNING :
     OUTPUT POINT OUT OF GRID :   -161.118    21.337  HNL61
     POINT SKIPPPED

 *** WAVEWATCH-III WARNING :
     OUTPUT POINT OUT OF GRID :   -160.707    22.494  HNL62
     POINT SKIPPPED

 *** WAVEWATCH-III WARNING :
     OUTPUT POINT OUT OF GRID :   -160.295    23.650  HNL63
     POINT SKIPPPED

 *** WAVEWATCH-III WARNING :
     OUTPUT POINT OUT OF GRID :   -159.275    23.198  HNL64
     POINT SKIPPPED

 *** WAVEWATCH-III WARNING :
     OUTPUT POINT OUT OF GRID :   -158.254    22.746  HNL65
     POINT SKIPPPED

 *** WAVEWATCH-III WARNING :
     OUTPUT POINT OUT OF GRID :   -158.759    21.799  HNL66
     POINT SKIPPPED

printed every second or few seconds, very slowly. So I just removed almost all the lines from ww3_shel.inp in order to speed up the ww3 init phase. After that, the model started the run phase and finished little more than 3hr in 30-min wall clock limit.

My run directory is /work2/noaa/stmp/djovic/c1152s2sw

@aerorahul the climatological fixed files (specified in input.nml) are still pointing to your directory, so strictly speaking this is not a self-contained (canned case) run directory. If you remove your gwWork directory, people will not be able to run this. Consider moving all fixed files in, for example, fix subdirectory and update the paths in input.nml to have a truly self-contained run directory.

DusanJovic-NOAA commented 6 days ago

Interestingly, c768 uses more cores than c1152:

$ diff /work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/c1152s2sw/ufs.configure /work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/c768s2sw/ufs.configure
23c23
< ATM_petlist_bounds:             0 4031
---
> ATM_petlist_bounds:             0 5567
34c34
< OCN_petlist_bounds:             4032 4251
---
> OCN_petlist_bounds:             5568 5787
48c48
< ICE_petlist_bounds:             4252 4371
---
> ICE_petlist_bounds:             5788 5907
64c64
< WAV_petlist_bounds:             4372 4963
---
> WAV_petlist_bounds:             5908 6499

Is this intended?

JessicaMeixner-NOAA commented 6 days ago

@DusanJovic-NOAA - That's a known issue that I'm working to fix in the wave model. We should probably change the buoy list in this test - particularly if you are using the debug queue as the point initialization in WW3 is known to take a long time.

For @RuiyuSun's test case here is a new ww3_shel.inp file which should reduce the known intitialization issues: /work2/noaa/marine/jmeixner/hercules/NewShel20241116/Ruiyu_New_ww3_shel.inp

For @aerorahul's test case for C1152 a new ww3_shel.inp is here: /work2/noaa/marine/jmeixner/hercules/NewShel20241116/Rahul_C1152_New_ww3_shel.inp For @aerorahul's test case for C768 a new ww3_shel.inp is here: /work2/noaa/marine/jmeixner/hercules/NewShel20241116/Rahul_C768_New_ww3_shel.inp

@DusanJovic-NOAA - The two different node counts for C768 and C1152 in Rahul's test cases are because that's the defaults in the g-w. We usually have to change the defaults to get good throughput.

FYI @DeniseWorthen

DeniseWorthen commented 6 days ago

I can confirm that with my DATM-S2SW test and using the points list in Dusan's run directory, I see a initialization time of ~27 minutes

20111001       0       0 WW3 InitializeRealize time:  1631.157

/work2/noaa/stmp/dworthen/stmp/dworthen/ww3pio/datm.hr4

My previous test had used the points list I got from Jiande, which contains only 611 points vs Dusan's 4264.

I had actually looked into this long-point finding time a bit, because George had mentioned it during the scalability meetings. It looks to me that every DE is searching the global domain for whether the global point list is contained w/in the grid. But each DE has a copy of the same ntri triangle list, as well as the global point list. So why does every DE need to do the search? And, for that matter, only the nappnt processor outputs the points, which are all retrieved from global arrays (either global input fields or va). So maybe only nappnt should do the search. I could be wrong, I haven't spent a lot of time on it.

EDIT: Also note that because of the 'negative longitude' problem, WW3 actually only ends up finding ~40% of the available points anyway:

Point output requested for  1624 points
JessicaMeixner-NOAA commented 6 days ago

@DeniseWorthen - We have two issues with the long point initialization. 1. some mis-match between grid and points that is not properly taken care of (https://github.com/NOAA-EMC/WW3/issues/1273) and 2. The fact that the search algorithm is not the fastest and even slower if it has to go through every one b/c of the mismatch issue. (https://github.com/NOAA-EMC/WW3/issues/1179). Once issue 1 is solved if there's still an issue my plan is to pre-process this part. This is an issue and is a top priority for me to get fixed ASAP. While the slow initialization needs to be avoided for this debugging work, I believe it's a separate, unrelated issue to the hanging we're seeing on orion/hercules.

Also not to muddy the waters here, but one of the things I'm trying to work on is to run a different grid for the wave model in the g-w and get everything set-up to do some testing - I'm running into issues with the other grid where things are segfaulting on the write grid component. This is a guess - but my guess is we found magic combinations of things that worked and now that we've slightly changed things those combinations no longer work and likely are related to the hanging issues but cannot prove that. I'm trying various combinations of the write grid component tasks to see if I can't find something that works. I am doing this today because I hadn't been able to make a successful run post cactus maintenance (didn't get enough chance to say anything with certainty or rule out user error) but wanted to get things in before the maintenance on dogwood in case wcoss2 runs also became an issue for c1152 post maintenance.

DeniseWorthen commented 6 days ago

@JessicaMeixner-NOAA I agree the "hang" is almost certainly due to the point output search.

For the point output search, it is the is_in_ungrid call you reference in your issue (https://github.com/NOAA-EMC/WW3/blob/7705171721e825d58e1e867e552e328fc812bfdd/model/src/w3triamd.F90#L1604) is the one which may need only to be called by the nappnt processor in w3init? Each DE has a copy of the ntri array that is being searched.

    IF ( FLOUT(2) ) CALL W3IOPP ( NPT, XPT, YPT, PNAMES, IMOD )
#ifdef W3_PDLIB
    CALL DEALLOCATE_PDLIB_GLOBAL(IMOD)
#endif

EDIT---oops, just re-read your post. You don't think the hang is related to the the point search. Hmmm....I guess it depends on how long people have waited for before declaring the job "hung"?

JessicaMeixner-NOAA commented 6 days ago

@DeniseWorthen - Let me rephrase a little. I agree that the hangs you are seeing are due to the wave initialization - which is in urgent need of a solution. However, we can get around the wave model initialization hang issue by reducing the number of points and ensuring the points we're looking for are from 0 to 360. If you do that, I still think we're going to get model hangs on orion/hercules (for example, we still have hangs on orion with S2S, we did find a combo that worked for S2S on hercules, but I think we'll still have that hang if we add waves back in but with a different point list like I had in my cases with different sets of points). That's why I want to make sure we update the ww3_shel.inp so that we are not having the known wave initialization issue cause problems.

JessicaMeixner-NOAA commented 5 days ago

Also - udpate on my WCOSS2 runs. Reducing the total number of write grid components seems to have helped. I'll post more details on Monday.

JessicaMeixner-NOAA commented 3 days ago

Also - udpate on my WCOSS2 runs. Reducing the total number of write grid components seems to have helped. I'll post more details on Monday.

On WCOSS2 running with a different wave grid, I got a segfault (full log file is on dogwood /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test03/COMROOT/Test03/logs/2020021300/gfs_fcst_seg0.log):

zeroing coupling accumulated fields at kdt=           12
nid001107.dogwood.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 2, n_atmsteps =               11 time is         0.792372
nid001652.dogwood.wcoss2.ncep.noaa.gov 9216:   d3d_on= F
nid001652.dogwood.wcoss2.ncep.noaa.gov: rank 9280 died from signal 9
nid001553.dogwood.wcoss2.ncep.noaa.gov 2056: forrtl: error (78): process killed (SIGTERM)

Rank 9280 is a write grid component task as ATM_petlist_bounds: 0 10175 ATM_omp_num_threads: 4 layout = 24,16

For this run I had: write_groups: 4 write_tasks_per_group: 60

Changing this to: write_groups: 2 write_tasks_per_group: 120

The successful log file is here: /lfs/h2/emc/couple/noscrub/jessica.meixner/WaveUglo15km/Test04/COMROOT/Test04/logs/2020021300/gfs_fcst_seg0.log

I suspect the issues we see on WCOSS2 are similar to what we've seen on hercules/orion but manifesting in segfaults versus hanging, but I could be wrong.