Open janjust opened 3 years ago
@manjugv @Sergei-Lebedev @vspetrov Hey guys - this is PR which adds the DPU team, developed during the hackathon by @artpol84 @Sergei-Lebedev and me.
It's the first attempt that successfully runs, but obviously needs strong vetting. We did preliminary data-checks with the xccl allreduce tests, seems to pass - and it successfully runs the pytorch param/comms bench.
@janjust please change the commit message as follows:
Co-authored-by: Artem Polyakov <artpol84@gmail.com>
Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>
I tried it out of curiosity and it works as expected: https://github.com/artpol84/xccl/commit/91a6466ba984109480c51fc7125559fdcc0b97d6
@janjust please change the commit message as follows:
Co-authored-by: Artem Polyakov <artpol84@gmail.com> Co-authored-by: Sergey Lebedev <sergeyle@nvidia.com>
done
This PR adds the new DPU team as well as a contrib directory with the accompanying DPU daemon app.
This is a first but comprehensive attempt which successfully runs pytorch param-comms benchmark. Tested over 32 bluefield enabled nodes.
There are several configury options to keep in mind when running.
new config options:
--with-dpu=yes
Signed-off-by: Tomislavj Janjusic tomislavj@nvidia.com
Co-authored-by: Artem Polyakov artpol84@gmail.com Sergey Lebedev sergeyle@nvidia.com