hydroframe / ParFlow_PerfTeam

Parflow is an open-source parallel watershed flow model.
http://inside.mines.edu/~rmaxwell/maxwell_software.shtml
Other
0 stars 0 forks source link

Simply Parallel Parflow implementation #43

Open ian-bertolacci opened 5 years ago

ian-bertolacci commented 5 years ago

Get a very simple parallel parflow implementations using some form of OpenMP.

ian-bertolacci commented 5 years ago

Attempted implementation: Create GrGeomInLoopParallel, parallelism enabled inside of a new GrGeomOctreeInteriorNodeLoopParallel, which parallelizes the 3D loops using Kokkos::OpenMP.

Performance: Not good. baseline_72hrx3hr_1de045 Unmodified Parflow at commit 1de045

Time Class Runtime (seconds) ± 1 standard deviation
real 332.3808 ± 1.7961
user 331.4917 ± 1.7961
sys 0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6 Parflow modified to split NFE loop 551 forall and reductions (see pull #13) (Small performance degradation already known in this implementation)

Time Class Runtime (seconds) ± 1 standard deviation
real 334.5446 ± 1.352
user 333.6424 ± 1.350
sys 0.7657 ± 0.0225

basic_kokkos_omp_72hrx3hr_fe7c39: Experimental kokkos implementation using Kokkos::OpenMP execution space

Time Class Runtime (seconds) ± 1 standard deviation
real 368.4230 ± 1.1382
user 367.9963 ± 1.1425
sys 0.3733 ± 0.0362

basic_kokkos_serial_72hrx3hr_based_fe7c39: Sanity-check implementation using Kokkos::Serial

Time Class Runtime (seconds) ± 1 standard deviation
real 367.7456 ± 1.4505
user 367.4793 ± 1.4537
sys 0.2154 ± 0.0207

A non-statistical experiment of scaling shows a negative scaling pattern. Analysis of individual loop runtimes on different machine (my laptop) indicates 10x slowdown in some loops.

Likely cause: Parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

ian-bertolacci commented 5 years ago

Attempted Implementation: Using Michaels OpenMP Pragma-in-macros solution (a true blessing), do the same as before (parallelize the 3D loop nest) using native OpenMP.

Performance: Again, not great. baseline_72hrx3hr_1de045:

Time Class Performance ± 1 standard deviation (seconds)
real 332.3808 ± 1.7961
user 331.4917 ± 1.7961
sys 0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6:

Time Class Performance ± 1 standard deviation (seconds)
real 334.5446 ± 1.3524
user 333.6424 ± 1.3506
sys 0.7657 ± 0.0225

simple_omp_threads-1_72hrx3hr_48f278: OMP_NUM_THREADS=1

Time Class Performance ± 1 standard deviation (seconds)
real 343.8862 ± 0.9246
user 343.4594 ± 0.9295
sys 0.3714 ± 0.0357

simple_omp_threads-2_72hrx3hr_48f278: OMP_NUM_THREADS=2

Time Class Performance ± 1 standard deviation (seconds)
real 347.6891 ± 0.9961
user 345.0150 ± 0.9959
sys 2.6195 ± 0.1100

simple_omp_threads-4_72hrx3hr_48f278: OMP_NUM_THREADS=4

Time Class Performance ± 1 standard deviation (seconds)
real 353.5161 ± 1.0830
user 348.1278 ± 1.0844
sys 5.3322 ± 0.1635

simple_omp_threads-8_72hrx3hr_48f278: OMP_NUM_THREADS=8

Time Class Performance ± 1 standard deviation (seconds)
real 362.1221 ± 0.9454
user 350.9784 ± 1.0036
sys 11.0904 ± 0.1666

Exhibits negative scaling effects. Also, system call impact becomes very large.

Likely cause: again, parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks: