Open ian-bertolacci opened 5 years ago
Attempted implementation: Create GrGeomInLoopParallel, parallelism enabled inside of a new GrGeomOctreeInteriorNodeLoopParallel, which parallelizes the 3D loops using Kokkos::OpenMP.
Performance: Not good. baseline_72hrx3hr_1de045 Unmodified Parflow at commit 1de045
Time Class | Runtime (seconds) ± 1 standard deviation |
---|---|
real | 332.3808 ± 1.7961 |
user | 331.4917 ± 1.7961 |
sys | 0.7596 ± 0.0166 |
split_loop_72hrx3hr_f2f7d6 Parflow modified to split NFE loop 551 forall and reductions (see pull #13) (Small performance degradation already known in this implementation)
Time Class | Runtime (seconds) ± 1 standard deviation |
---|---|
real | 334.5446 ± 1.352 |
user | 333.6424 ± 1.350 |
sys | 0.7657 ± 0.0225 |
basic_kokkos_omp_72hrx3hr_fe7c39: Experimental kokkos implementation using Kokkos::OpenMP execution space
Time Class | Runtime (seconds) ± 1 standard deviation |
---|---|
real | 368.4230 ± 1.1382 |
user | 367.9963 ± 1.1425 |
sys | 0.3733 ± 0.0362 |
basic_kokkos_serial_72hrx3hr_based_fe7c39: Sanity-check implementation using Kokkos::Serial
Time Class | Runtime (seconds) ± 1 standard deviation |
---|---|
real | 367.7456 ± 1.4505 |
user | 367.4793 ± 1.4537 |
sys | 0.2154 ± 0.0207 |
A non-statistical experiment of scaling shows a negative scaling pattern. Analysis of individual loop runtimes on different machine (my laptop) indicates 10x slowdown in some loops.
Likely cause: Parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.
Tasks:
Attempted Implementation: Using Michaels OpenMP Pragma-in-macros solution (a true blessing), do the same as before (parallelize the 3D loop nest) using native OpenMP.
Performance: Again, not great. baseline_72hrx3hr_1de045:
Time Class | Performance ± 1 standard deviation (seconds) |
---|---|
real | 332.3808 ± 1.7961 |
user | 331.4917 ± 1.7961 |
sys | 0.7596 ± 0.0166 |
split_loop_72hrx3hr_f2f7d6:
Time Class | Performance ± 1 standard deviation (seconds) |
---|---|
real | 334.5446 ± 1.3524 |
user | 333.6424 ± 1.3506 |
sys | 0.7657 ± 0.0225 |
simple_omp_threads-1_72hrx3hr_48f278: OMP_NUM_THREADS=1
Time Class | Performance ± 1 standard deviation (seconds) |
---|---|
real | 343.8862 ± 0.9246 |
user | 343.4594 ± 0.9295 |
sys | 0.3714 ± 0.0357 |
simple_omp_threads-2_72hrx3hr_48f278: OMP_NUM_THREADS=2
Time Class | Performance ± 1 standard deviation (seconds) |
---|---|
real | 347.6891 ± 0.9961 |
user | 345.0150 ± 0.9959 |
sys | 2.6195 ± 0.1100 |
simple_omp_threads-4_72hrx3hr_48f278: OMP_NUM_THREADS=4
Time Class | Performance ± 1 standard deviation (seconds) |
---|---|
real | 353.5161 ± 1.0830 |
user | 348.1278 ± 1.0844 |
sys | 5.3322 ± 0.1635 |
simple_omp_threads-8_72hrx3hr_48f278: OMP_NUM_THREADS=8
Time Class | Performance ± 1 standard deviation (seconds) |
---|---|
real | 362.1221 ± 0.9454 |
user | 350.9784 ± 1.0036 |
sys | 11.0904 ± 0.1666 |
Exhibits negative scaling effects. Also, system call impact becomes very large.
Likely cause: again, parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.
Tasks:
Get a very simple parallel parflow implementations using some form of OpenMP.