marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Use multiple CUDA streams for true overlap of compute and data transfer on the GPU. #115

Open JasonMoho opened 1 year ago

JasonMoho commented 1 year ago

Is your feature request related to a problem? Please describe. Currently, H2D and D2H data transfer occur on the default CUDA stream, which is also used by the model forward/backward pass computation. This limits our utilization of the GPU and reduces training throughput by a small amount.

By separating computation from data transfer on different CUDA streams, our processing pipeline can be better optimized.

Salient is an example of a system that uses a well-optimized pipeline for training GNNs for node classification. Its separated CUDA stream implementation is located here: https://github.com/MITIBMxGraph/SALIENT/blob/master/fast_trainer/transferers.py#L22