Use multiple CUDA streams for true overlap of compute and data transfer on the GPU.

Is your feature request related to a problem? Please describe. Currently, H2D and D2H data transfer occur on the default CUDA stream, which is also used by the model forward/backward pass computation. This limits our utilization of the GPU and reduces training throughput by a small amount.

By separating computation from data transfer on different CUDA streams, our processing pipeline can be better optimized.

Salient is an example of a system that uses a well-optimized pipeline for training GNNs for node classification. Its separated CUDA stream implementation is located here: https://github.com/MITIBMxGraph/SALIENT/blob/master/fast_trainer/transferers.py#L22

marius-team / marius

Use multiple CUDA streams for true overlap of compute and data transfer on the GPU. #115