Open gareth-ferneyhough opened 8 years ago
I see increased performance with the MaxFlops & QTC benchmarks.
In my runs, at least, the Stencil2D GFLOPS metric holds steady and so do the Scan cases. It may be the case that the runs on each GPU are done in sequence, so the time increases in proportion to the number of GPUs and the time-normalized performance metrics average-out the same.
I am expecting to observe a speedup when I run either an EP or TP benchmark on multiple devices, but that is not the case. The Stencil2D benchmark does show a speedup when I use multiple devices:
./shocdriver -d 0 -cuda -s 4 -benchmark Stencil2D
result for stencil: 141.2280 GFLOPS
vs../shocdriver -d 0,1,2,3 -cuda -s 4 -benchmark Stencil2D
result for stencil: 406.1190 GFLOPS
However, this is the only benchmark I have found (so far) that shows a speedup. For example:
./shocdriver -d 0 -cuda -s 4 -benchmark Scan
result for scan: 46.8924 GB/s
vs./shocdriver -d 0,1,2,3 -cuda -s 4 -benchmark Scan
result for scan: 46.8561 GB/s
Similarly, Reduction and GEMM show no improvement either. Am I missing something here? I am running version 1.1.5