The first commit adds MPI Barriers prior to each kernel call. This changes the barriers to be at the run granularity rather than at the config granularity since the default number of runs is 10 and we probably want to make sure each rank is synced up before calling the actual kernel. This was done for each kernel of all 3 backends
The second commit is the suggested load-balance fix which adds an MPI Barrier prior to the call to sg_get_time_ms after each kernel has finished (for the Serial and OpenMP backends). The result will be that the measured time corresponds to the slowest rank. This time is captured for each run (the default number of runs is set to 10) and then the minimum runtime (maximum BW) is reported by each rank.
I separated this into 2 commits
The first commit adds MPI Barriers prior to each kernel call. This changes the barriers to be at the run granularity rather than at the config granularity since the default number of runs is 10 and we probably want to make sure each rank is synced up before calling the actual kernel. This was done for each kernel of all 3 backends
The second commit is the suggested load-balance fix which adds an MPI Barrier prior to the call to sg_get_time_ms after each kernel has finished (for the Serial and OpenMP backends). The result will be that the measured time corresponds to the slowest rank. This time is captured for each run (the default number of runs is set to 10) and then the minimum runtime (maximum BW) is reported by each rank.