Parallel version slower than sequential

FractalArt commented 3 years ago

Hi,

I am reading your book and I have arrived at chapter 7 and was wondering whether you have any timings available for comparison. I was timing the code and was surprised to find that it runs faster on a single core than on four:

tsunami_git/src/ch07 on 🌱 master (34af56e)  
> hyperfine "cafrun -n 1 tsunami"
Benchmark #1: cafrun -n 1 tsunami
  Time (mean ± σ):     646.2 ms ±  51.0 ms    [User: 339.5 ms, System: 104.0 ms]
  Range (min … max):   566.3 ms … 719.4 ms    10 runs

tsunami_git/src/ch07 on 🌱 master (34af56e) 
> hyperfine "cafrun -n 4 tsunami"
Benchmark #1: cafrun -n 4 tsunami
  Time (mean ± σ):     751.9 ms ±  56.0 ms    [User: 1.473 s, System: 0.212 s]
  Range (min … max):   671.5 ms … 863.1 ms    10 runs

I think my coarray installation works since for the weather-buoy example, I do indeed see a speedup. Could it be that there is so much synchronization going on that it ruins the benefits of parallelization?

I am running on Ubuntu 20.04.1 LTS, with OpenCoarrays 2.9.0 (installed as described in Appendix A) on an Intel i7-8565U CPU @ 1.80GHz × 8 processor and I use gfortran 9.3.0 as a compiler.

milancurcic commented 3 years ago

Hi @FractalArt, you're right, the communication overwhelms the computation so the code as is doesn't scale.

Try two things in tsunami.f90:

Increase grid size:
```
integer(int32), parameter :: grid_size = 1000
```
which was 100 originally, but you can try even larger. We kept it small in the book so that the example runs fast on single core. Increasing grid size increases the computation, which decreases the communication / computation ratio (you want this ratio as small as you can for parallel scaling).
Comment out the gather + print lines inside the time loop:
```
...
! gather to image 1 and write current state to screen
!gather(is:ie)[1] = h(ils:ile) ! there is a all-to-one communication here
sync all ! this sync is still important before we move to the next time step
!if (this_image() == 1) print *, n, gather

end do time_loop
```
This reduces the communication in each step. This gather operation is only for diagnostic purpose so removing it doesn't affect the result. Alternatively, to preserve some diagnostics, you can do the gather + print every 10th or 100th time step.

Let me know how this works out.

Ideally, this should have been explained in the book. We had a section about it but it didn't make the cut. However it should still be explained at least in the README of this repo. Do you agree?

Thank you for reading and reporting this.

FractalArt commented 3 years ago

Hi @milancurcic,

Thanks a lot for your quick reply. You're right, if I crank up the grid size I see the improvement, and as you say, the higher the grid size the bigger the impact of parallelization.

Regarding the explanation in the README, I have to say that I did not find it.

milancurcic commented 3 years ago

Regarding the explanation in the README, I have to say that I did not find it.

Yes, there isn't any right now, I meant I should make an effort to write an explanation there.

modern-fortran / tsunami

Parallel version slower than sequential #9