Simply Parallel Parflow implementation

ian-bertolacci commented 5 years ago

Get a very simple parallel parflow implementations using some form of OpenMP.

ian-bertolacci commented 5 years ago

Attempted implementation: Create GrGeomInLoopParallel, parallelism enabled inside of a new GrGeomOctreeInteriorNodeLoopParallel, which parallelizes the 3D loops using Kokkos::OpenMP.

Performance: Not good. baseline_72hrx3hr_1de045 Unmodified Parflow at commit 1de045

Time Class	Runtime (seconds) ± 1 standard deviation
real	332.3808 ± 1.7961
user	331.4917 ± 1.7961
sys	0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6 Parflow modified to split NFE loop 551 forall and reductions (see pull #13) (Small performance degradation already known in this implementation)

Time Class	Runtime (seconds) ± 1 standard deviation
real	334.5446 ± 1.352
user	333.6424 ± 1.350
sys	0.7657 ± 0.0225

basic_kokkos_omp_72hrx3hr_fe7c39: Experimental kokkos implementation using Kokkos::OpenMP execution space

Time Class	Runtime (seconds) ± 1 standard deviation
real	368.4230 ± 1.1382
user	367.9963 ± 1.1425
sys	0.3733 ± 0.0362

basic_kokkos_serial_72hrx3hr_based_fe7c39: Sanity-check implementation using Kokkos::Serial

Time Class	Runtime (seconds) ± 1 standard deviation
real	367.7456 ± 1.4505
user	367.4793 ± 1.4537
sys	0.2154 ± 0.0207

A non-statistical experiment of scaling shows a negative scaling pattern. Analysis of individual loop runtimes on different machine (my laptop) indicates 10x slowdown in some loops.

Likely cause: Parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

Move parallelism to be over boxes
Get running with Steve's Octree changes.

ian-bertolacci commented 5 years ago

Attempted Implementation: Using Michaels OpenMP Pragma-in-macros solution (a true blessing), do the same as before (parallelize the 3D loop nest) using native OpenMP.

Performance: Again, not great. baseline_72hrx3hr_1de045:

Time Class	Performance ± 1 standard deviation (seconds)
real	332.3808 ± 1.7961
user	331.4917 ± 1.7961
sys	0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6:

Time Class	Performance ± 1 standard deviation (seconds)
real	334.5446 ± 1.3524
user	333.6424 ± 1.3506
sys	0.7657 ± 0.0225

simple_omp_threads-1_72hrx3hr_48f278: OMP_NUM_THREADS=1

Time Class	Performance ± 1 standard deviation (seconds)
real	343.8862 ± 0.9246
user	343.4594 ± 0.9295
sys	0.3714 ± 0.0357

simple_omp_threads-2_72hrx3hr_48f278: OMP_NUM_THREADS=2

Time Class	Performance ± 1 standard deviation (seconds)
real	347.6891 ± 0.9961
user	345.0150 ± 0.9959
sys	2.6195 ± 0.1100

simple_omp_threads-4_72hrx3hr_48f278: OMP_NUM_THREADS=4

Time Class	Performance ± 1 standard deviation (seconds)
real	353.5161 ± 1.0830
user	348.1278 ± 1.0844
sys	5.3322 ± 0.1635

simple_omp_threads-8_72hrx3hr_48f278: OMP_NUM_THREADS=8

Time Class	Performance ± 1 standard deviation (seconds)
real	362.1221 ± 0.9454
user	350.9784 ± 1.0036
sys	11.0904 ± 0.1666

Exhibits negative scaling effects. Also, system call impact becomes very large.

Likely cause: again, parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

Move parallelism to be over boxes
Get running with Steve's Octree changes.

hydroframe / ParFlow_PerfTeam

Simply Parallel Parflow implementation #43