Improve all-to-all benchmark

This PR improves all-to-all benchmark to have a series of runs with various sizes.

Sample output:

Size (MB): 1, Elasped time (s): 0.000176531, Bandwidth per GPU (GB/s): 16.9942
Size (MB): 2, Elasped time (s): 0.000267493, Bandwidth per GPU (GB/s): 22.4305
Size (MB): 4, Elasped time (s): 0.000302923, Bandwidth per GPU (GB/s): 39.614
Size (MB): 8, Elasped time (s): 0.000512558, Bandwidth per GPU (GB/s): 46.824
Size (MB): 16, Elasped time (s): 0.000820293, Bandwidth per GPU (GB/s): 58.5157
Size (MB): 32, Elasped time (s): 0.00148969, Bandwidth per GPU (GB/s): 64.4429
Size (MB): 64, Elasped time (s): 0.0027744, Bandwidth per GPU (GB/s): 69.2042
Size (MB): 128, Elasped time (s): 0.00543221, Bandwidth per GPU (GB/s): 70.6895
Size (MB): 256, Elasped time (s): 0.0106805, Bandwidth per GPU (GB/s): 71.9069
Size (MB): 512, Elasped time (s): 0.0212418, Bandwidth per GPU (GB/s): 72.3102
Size (MB): 1024, Elasped time (s): 0.0423483, Bandwidth per GPU (GB/s): 72.5413
Size (MB): 2048, Elasped time (s): 0.0845411, Bandwidth per GPU (GB/s): 72.6748
Size (MB): 4096, Elasped time (s): 0.168941, Bandwidth per GPU (GB/s): 72.7353

This PR is built on top of #49

rapidsai / distributed-join

Improve all-to-all benchmark #50