The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs.
The perf RunningAvgSamplesPerSec metrics improves on a multi gpu node, tested on AMD GPU with ROCm stack.
As number of GPUs increases; without this commit, GPU 0 takes in more load compared to other GPUs.
The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs. The perf RunningAvgSamplesPerSec metrics improves on a multi gpu node, tested on AMD GPU with ROCm stack. As number of GPUs increases; without this commit, GPU 0 takes in more load compared to other GPUs.