ufs-community / ufs-weather-model

UFS Weather Model
Other
142 stars 249 forks source link

Scalability of the Coupled Model #1367

Open DeniseWorthen opened 2 years ago

DeniseWorthen commented 2 years ago

The EPIC includes issues the scalability issues and solutions in the coupled P8 runs. It includes the following tasks:

1) Create scalability profile for each component used in coupled P8 runs in standalone mode, identify scalability issues.

2) Identify issues in coupling mode.

3) Identify scalability issues in high resolution coupled runs (e.g. C768mx025).

DeniseWorthen commented 2 years ago

MOM6: @jiandewang 1) Changing IO layout to (4,2) instead of (1,1) in test case resulted in ~4% speedup (history/restart files both). 2) MOM6 can read the additional restart files, but this will require changes to downstream. Combining utility is available for history and restart files. Combining history files would need to be implemented w/ post workflow. 2) Land block elimination has been tested in MOM6 standalone mode but not in coupled model. Need to generate small fix file to specify the land domain

CICE6: @DeniseWorthen No scalability analysis has been performed. However, CICE6 is cheap to run and is unlikely to impact coupled model performance significantly.

WW3: George Vandenberg and Matt Masarik have used gprof to identify bottlenecks in WW3. The SR init_get_jsea_isproc has been identified as an issue. Solutions include removing a call which utilizes the _isproc routine (and which scales as num sea points*(num PE)^2) and in-lining other locations. Inlining impact not large; testing now removing w3nmin call or omp threading the w3nmin call.

GOCART: @bbakernoaa reports that GOCART currently has no threading capability; adding OMP calls to the NUOPC Cap is being examined. There are also downstream issues in UPP (having problem computing all the diag fields in UPP.)

GOCART and FV3 share nodes because of shared memory concerns and more communication. ESMF-managed threading can run different threading levels on same nodes (DE-sharing). NASA reports that MPI scaling is good for GOCART so have not examined threading options. DE-sharing may be less invasive and leverage existing MPI scalability of GOCART.

ATM: George will look at ATM after finishing w/ WW3.

jiandewang commented 2 years ago

testing of land block elimination approach in UFS doesn't work. I suspect in cap its mesh requires all subdomain information. Error information can be found in ocean PET files, for example, in PET503.ESMF_LogFile MeshCap::meshcreateredistelems() Internal error

run directory can be found at /scratch1/NCEPDEV/climate/Jiande.Wang/working/MOM6-scalability/UFS-land-mask/T1

my testing is based on latest UFS (hash # 5477338bf), using cpld_bmark_p8 as a template, modified nems.comfigure on PE numbers for ocean, and added "mask_table.8.10x12" inside INPUT directory. MOM_override is setup as LAYOUT=10,12 MASKTABLE=mask_table.8.10x12 !120-8=112

DeniseWorthen commented 2 years ago

@jiandewang The cice cap has the added capability for land block elimination but the MOM6 cap does not. That will need to be added. How do you set up the mask_table? Can you make one for easy testing w/ the mx100 ocean?

jiandewang commented 2 years ago

@DeniseWorthen my previous test is based on mx025, I will have a mx100 for you and a README file for set up et. al

DeniseWorthen commented 2 years ago

cice_perf_cesm_craig_2012.pdf

jiandewang commented 2 years ago

@DeniseWorthen mx1x1 is only using 20PE for ocean, every PE will contain ocean points for whatever X-Y layout I tried. So I have a sample mx05 for you at /scratch1/NCEPDEV/climate/Jiande.Wang/working/MOM6-scalability/mask-PE/05 inside check_mask there is a generate-mask-table.sh which is used to generate PE mask table, you can see the usage in the comment lines there. cpld_control_c192_p8 is the run dir I tried.

DeniseWorthen commented 2 years ago

@jiandewang reports not much gain for c384 with land block elimination. Will need someone to add feature to NUOPC cap if required.

Netcdf compression is not supported in current FMS2io code. GFDL says this option can be turned on but may require new release.

Matt will have a PR to remove bottleneck identified via gprof. 3 remaining routines have bulk of impact. First take getting some performance gains w/ OMP gave mixed results. Most likely exhausted quick fixes.

DeniseWorthen commented 2 years ago

Denise can talk to Tony about presenting CICE6 results at Scalability meeting.

DeniseWorthen commented 2 years ago

Gerhard reminds us that MOM6 output is synchronous (on forecast tasks), so when MOM6 writes, it holds the system up. MOM6 forecast w/in time inner loop takes (already ready when next update is ready).

Suggestion is that the valid metric to use is not the overall run time impact, but how much IO costs you when it happens, relative to normal cycle (one w/o IO).

DeniseWorthen commented 2 years ago

Ali provided a 10min and 15min unstructured mesh to test for scalability of the coupled model using the unstructured mesh.

junwang-noaa commented 1 year ago

This is an ongoing task. The progress can be tracked at google sheet: GFSv17 highres and GFSv1 google sheets at:

https://docs.google.com/spreadsheets/d/1-plAZ7h7iLoCzOH9rkjklKmeN42dE-2-1mdCLugk4xI/edit#gid=1272699869

GFSv17S2S HR1 and HR2 scalability analysis has been conducted. We will look into HR3 when it becomes available.

DeniseWorthen commented 10 months ago

I've created a feature branch to implement the same timer logging feature in WW3 as was done for CICE. It includes a feature to over-write the timesteps in the mod_def file via configuration variables, which allows the same mod_def file to be used for either the inner or outerloop coupling. https://github.com/DeniseWorthen/WW3/tree/feature/logtimer-nosync