FFCV allocates an additional CUDA context on GPU 0 for every rank besides rank 0 when doing distributed training. This is because the call to pin_memory requires a CUDA context and this gets allocated on the current device. This PR calls torch.cuda.set_device in the loader thread to alleviate this issue.
Before
After
(Please ignore the slight differences in memory usage, I took the Before screenshot slightly before all the memory for CUDA was allocated).
FFCV allocates an additional CUDA context on GPU 0 for every rank besides rank 0 when doing distributed training. This is because the call to
pin_memory
requires a CUDA context and this gets allocated on the current device. This PR callstorch.cuda.set_device
in the loader thread to alleviate this issue.Before
After
(Please ignore the slight differences in memory usage, I took the Before screenshot slightly before all the memory for CUDA was allocated).