Mpi barriers - Githubissues

I separated this into 2 commits

The first commit adds MPI Barriers prior to each kernel call. This changes the barriers to be at the run granularity rather than at the config granularity since the default number of runs is 10 and we probably want to make sure each rank is synced up before calling the actual kernel. This was done for each kernel of all 3 backends

The second commit is the suggested load-balance fix which adds an MPI Barrier prior to the call to sg_get_time_ms after each kernel has finished (for the Serial and OpenMP backends). The result will be that the measured time corresponds to the slowest rank. This time is captured for each run (the default number of runs is set to 10) and then the minimum runtime (maximum BW) is reported by each rank.

hpcgarage / spatter

Mpi barriers #181