Enable OpenMP threading to speed up orog_gsl and filter_topo programs

BinLiu-NOAA commented 5 months ago

Unlike the orog.fd program, the orog_gsl.fd and filter_topo.fd source codes currently do not support OpenMP threading. These step are very slow when processing high-resolution large domains.

For example, for the HAFS application, it needs to generate ~1.8-km resolution orog data to cover a large ~100x85 degree domain for every forecast cycle since HAFS currently uses a storm-centric moving-nesting configuration. In this case, the filter_topo step takes ~12 mins and the orog_gsl step takes ~22 mins, respectively.

If possible, it would be beneficial to add omp directives and enable OpenMP threading to speed up these two programs.

Thanks!

GeorgeGayno-NOAA commented 5 months ago

@mdtoyNOAA - Any idea which parts of orog_gsl could benefit from threading?

GeorgeGayno-NOAA commented 5 months ago

@BinLiu-NOAA - what are the grid specs for the HAFS application? Do you have work directory where I can capture a sample case?

BinLiu-NOAA commented 5 months ago

@GeorgeGayno-NOAA I saved an example configuration results and log file from the HAFS atm_prep_mvnest job (including creating the regional ESG grid, orog, orog_gsl, topo_filter, sfc_climo steps) here on WCOSS2: /lfs/h2/emc/hur/noscrub/bin.liu/save/ufs_utils_orog_omp_speedup Thanks!

mdtoyNOAA commented 5 months ago

@GeorgeGayno-NOAA I'm not a threading expert, but I think the main loop, i.e., do j = 1,dimY_FV3 do i = 1,dimX_FV3 ... end do end do which is in both module_gsl_oro_data_lg_scale.f90 and module_gsl_oro_data_sm_scale.f90 could be threaded.

GeorgeGayno-NOAA commented 5 months ago

The log file provided by @BinLiu-NOAA shows the following "regional_esg" grid setup:

target_lon=-86.4
target_lat=20.3
idim=4320
jdim=3960
delx=0.009
dely=0.009

GeorgeGayno-NOAA commented 5 months ago

For a quicker test of the filter_topo code, I created a smaller regional grid (1711 x 1510). The data and scripts to run it are on Hercules: /work2/noaa/da/ggayno/save/filter.omp

Some timing code was added to filter_topo at 21b68f7, Running the test showed most time is spent in the read_grid_file routine.

timing compute_filter_constants   1.000000003841706E-003
timing read_grid_file    93.0559999999969
timing read_topo_file   3.800000000046566E-002
timing FV3_zs_filter    1.07500000000437
timing write_topo_file   2.399999999761349E-002
timing total    94.1960000000036

GeorgeGayno-NOAA commented 4 months ago

The parallelization of the read_grid_file routine of the filter_topo was completed at 15d7f65. A previous test using a 1711 x 1510 regional grid showed that nearly all the time is spent in that routine.

The same test was rerun twice (on Hercules)- with 1 and 6 threads. The script used was: /work2/noaa/da/ggayno/save/filter.omp/run.sh

Running with one thread took 109 seconds. Using 6 threads took 20 seconds. Very scalable.

1 thread

 timing read_grid_file    108.677000000003
 timing total    109.925999999999

6 threads

 timing read_grid_file    18.7339999999967
 timing total    19.9030000000057

The output orography files from both tests were bit identical as expected.

One thing I can't explain, the serial version in 'develop' runs in 94 seconds. Running with one thread takes 109 seconds. Will investigate.

GeorgeGayno-NOAA commented 4 months ago

The grid_gen regression tests were run on Hercules using 15d7f65. The tests used 24 threads. The baseline files were created using the serial version of filter_topo. All tests passed as expected.

GeorgeGayno-NOAA commented 4 months ago

@BinLiu-NOAA - would you like to try the threaded filter_topo code? It scales very well. Use my branch at 15d7f65.

GeorgeGayno-NOAA commented 4 months ago

A timing test using a C768 uniform grid was done on Hercules using 15d7f65 and this script: /work2/noaa/da/ggayno/save/filter.omp/run.768.sh

Using 1 thread took 150 seconds. Using 6 threads took 26.5 seconds. Very scalable. The orog files from each test were bit identical.

BinLiu-NOAA commented 4 months ago

A timing test using a C768 uniform grid was done on Hercules using 15d7f65 and this script: /work2/noaa/da/ggayno/save/filter.omp/run.768.sh

Using 1 thread took 150 seconds. Using 6 threads took 26.5 seconds. Very scalable. The orog files from each test were bit identical.

@GeorgeGayno-NOAA Great to know. We will try from our end and get back to you. Thanks!

BinLiu-NOAA commented 4 months ago

@BinLiu-NOAA - would you like to try the threaded filter_topo code? It scales very well. Use my branch at 15d7f65.

@GeorgeGayno-NOAA For the test from the HAFS side, using the new filter_topo code, with OMP Threads of 20 on WCOSS2, the run time for the filter_topo step was reduced substantially down to ~60s (from the original 690s). This is like > 10x speed-up, which is fantastic!

Here is the related print out time info from the new run: timing read_grid_file 53.9399999999996 timing read_topo_file 1.04899999999998 Before filter: Max_slope= 1.29018118104086 After filter: Max_slope= 0.578221088257420 Before filter: Max_slope= 0.541956742513021 After filter: Max_slope= 0.571173665748806 timing FV3_zs_filter 4.02199999999993 timing write_topo_file 0.253999999999905 timing total 59.2659999999996

Thanks @GeorgeGayno-NOAA!

GeorgeGayno-NOAA commented 4 months ago

@BinLiu-NOAA - would you like to try the threaded filter_topo code? It scales very well. Use my branch at 15d7f65.

@GeorgeGayno-NOAA For the test from the HAFS side, using the new filter_topo code, with OMP Threads of 20 on WCOSS2, the run time for the filter_topo step was reduced substantially down to ~60s (from the original 690s). This is like > 10x speed-up, which is fantastic!

Here is the related print out time info from the new run: timing read_grid_file 53.9399999999996 timing read_topo_file 1.04899999999998 Before filter: Max_slope= 1.29018118104086 After filter: Max_slope= 0.578221088257420 Before filter: Max_slope= 0.541956742513021 After filter: Max_slope= 0.571173665748806 timing FV3_zs_filter 4.02199999999993 timing write_topo_file 0.253999999999905 timing total 59.2659999999996

Thanks @GeorgeGayno-NOAA!

@BinLiu-NOAA - glad to hear it is working for you. I have tried three tests on Hercules - a regional case, a global C768 and a global C1152. I get nearly perfect scalability up to 18 threads. You are getting good results, but I would have expected better.

GeorgeGayno-NOAA commented 4 months ago

@BinLiu-NOAA - I also threaded the orog_gsl program. Try it at 1f9cd42.

GeorgeGayno-NOAA commented 4 months ago

Tested the threaded orog_gsl program using a regional (1011 x 810) grid on Hercules using the scripts in /work2/noaa/da/ggayno/save/orog_gsl.omp:

run.dev.sh - Runs the program using the serial version in the 'develop' branch.
run.sh - Runs the program the parallel version in the branch using 1, 5 and 12 threads. The output files are compared to those produced by 'develop' using the 'cmp' command.

The branch tests successfully reproduced the files from the 'develop' branch.

GeorgeGayno-NOAA commented 4 months ago

Tested the threaded orog_gsl program using a regional (1011 x 810) grid on Hercules using the scripts in /work2/noaa/da/ggayno/save/orog_gsl.omp:

run.dev.sh - Runs the program using the serial version in the 'develop' branch.

run.sh - Runs the program the parallel version in the branch using 1, 5 and 12 threads. The output files are compared to those produced by 'develop' using the 'cmp' command.

The branch tests successfully reproduced the files from the 'develop' branch.

Most of the CPU time is spent in the main loops of calc_gsl_oro_data_sm_scale and calc_gsl_oro_data_sm_scale. The program scales well (times in seconds):

One thread
 timing of main loop in calc_gsl_oro_data_sm_scale    55.7529999999970
 timing of main loop in calc_gsl_oro_data_lg_scale    11.7720000000045

5 threads
 timing of main loop in calc_gsl_oro_data_sm_scale    12.6569999999992
 timing of main loop in calc_gsl_oro_data_lg_scale    2.64899999999761

12 threads
 timing of main loop in calc_gsl_oro_data_sm_scale    5.31500000000233
 timing of main loop in calc_gsl_oro_data_lg_scale    1.11299999999756

BinLiu-NOAA commented 4 months ago

@GeorgeGayno-NOAA, Great to know that orog_gsl program/step is now threading enabled. We will also test from the HAFS testing case as well. Will report the timing through a follow-up update. Thanks!

BinLiu-NOAA commented 4 months ago

@GeorgeGayno-NOAA Just a quick follow-up that, my HAFS side test (with your commit at https://github.com/ufs-community/UFS_UTILS/commit/1f9cd420053791be769d474b4ce414b150d36e8b) showed the orog_gsl step got the wallclock time reduced from 1132s to 86s with OMP threading of 20. Again, very impressive ~13x speed-up.

BinLiu-NOAA commented 4 months ago

@GeorgeGayno-NOAA, Another information as I mentioned to you earlier today, for the original orog program/step, even though it has already enabled OMP threading support, with OMP threading of 20, it still takes ~267s for the same HAFS cycle/domain (much slower than the speeded-up orog_gsl program, which now only takes ~86s). So, if possible, appreciate your help to revisit and look into the original orog program/step to see if further OMP threading speeding up optimization might be feasible. Thanks a lot!

GeorgeGayno-NOAA commented 4 months ago

@GeorgeGayno-NOAA Just a quick follow-up that, my HAFS side test (with your commit at 1f9cd42) showed the orog_gsl step got the wallclock time reduced from 1132s to 86s with OMP threading of 20. Again, very impressive ~13x speed-up.

That's great. I was getting better scaling, but maybe that was a function of the test cases I used. Will the speed up you are getting satisfy your OPS timelines?

GeorgeGayno-NOAA commented 4 months ago

@GeorgeGayno-NOAA, Another information as I mentioned to you earlier today, for the original orog program/step, even though it has already enabled OMP threading support, with OMP threading of 20, it still takes ~267s for the same HAFS cycle/domain (much slower than the speeded-up orog_gsl program, which now only takes ~86s). So, if possible, appreciate your help to revisit and look into the original orog program/step to see if further OMP threading speeding up optimization might be feasible. Thanks a lot!

I can take a look at the orog.fd code. But I would like to do that under another issue. I will open one.

BinLiu-NOAA commented 4 months ago

@GeorgeGayno-NOAA Just a quick follow-up that, my HAFS side test (with your commit at 1f9cd42) showed the orog_gsl step got the wallclock time reduced from 1132s to 86s with OMP threading of 20. Again, very impressive ~13x speed-up.

That's great. I was getting better scaling, but maybe that was a function of the test cases I used. Will the speed up you are getting satisfy your OPS timelines?

@GeorgeGayno-NOAA I think the 11x-13x speeding up with OMP threading of 20 should be good enough for HAFS application side (at least with current HAFS domain/resolution configurations). Thanks a lot for working on this and enabling the threading support for filter_topo and orog_gsl programs/steps! Much appreciated!

GeorgeGayno-NOAA commented 4 months ago

@GeorgeGayno-NOAA, Another information as I mentioned to you earlier today, for the original orog program/step, even though it has already enabled OMP threading support, with OMP threading of 20, it still takes ~267s for the same HAFS cycle/domain (much slower than the speeded-up orog_gsl program, which now only takes ~86s). So, if possible, appreciate your help to revisit and look into the original orog program/step to see if further OMP threading speeding up optimization might be feasible. Thanks a lot!

I can take a look at the orog.fd code. But I would like to do that under another issue. I will open one.

Will optimize orog.fd under #947.

ufs-community / UFS_UTILS

Enable OpenMP threading to speed up orog_gsl and filter_topo programs #939