will be interesting to have a look if we observe similar speed, and code is probably useful too.
note that their benchmark is only raw communication all-reduce, no learning. this is relevant if one is communication bound. so we might likely see this scenario when training linear models soon
we found this benchmark here: https://github.com/diux-dev/cluster/tree/master/pytorch_distributed_benchmark
will be interesting to have a look if we observe similar speed, and code is probably useful too. note that their benchmark is only raw communication all-reduce, no learning. this is relevant if one is communication bound. so we might likely see this scenario when training linear models soon