Is your feature request related to a problem? Please describe.
Currently, H2D and D2H data transfer occur on the default CUDA stream, which is also used by the model forward/backward pass computation. This limits our utilization of the GPU and reduces training throughput by a small amount.
By separating computation from data transfer on different CUDA streams, our processing pipeline can be better optimized.
Is your feature request related to a problem? Please describe. Currently, H2D and D2H data transfer occur on the default CUDA stream, which is also used by the model forward/backward pass computation. This limits our utilization of the GPU and reduces training throughput by a small amount.
By separating computation from data transfer on different CUDA streams, our processing pipeline can be better optimized.
Salient is an example of a system that uses a well-optimized pipeline for training GNNs for node classification. Its separated CUDA stream implementation is located here: https://github.com/MITIBMxGraph/SALIENT/blob/master/fast_trainer/transferers.py#L22