https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/mag/sampler.py
When I proflied this program with Nsight, I found that many Memcpy DtoH operations appeared dispersedly in the forward and backward stages of training. There is a small gap behind Memcpy DtoH. I want to know the reason why Memcpy DtoH appears, because the data copy from D to H in the compute process is very strange. In addition, these Memcpy DtoH appear behind DeviceReduceKernel and DeviceReduceSingleTileKernel. These operations also appear in a Linear. I also want to know what is the connection between Memcpy DtoH and DeviceReduceSingleTileKernel, and why these two operations always appear together?
https://github.com/snap-stanford/ogb/blob/master/examples/nodeproppred/mag/sampler.py When I proflied this program with Nsight, I found that many Memcpy DtoH operations appeared dispersedly in the forward and backward stages of training. There is a small gap behind Memcpy DtoH. I want to know the reason why Memcpy DtoH appears, because the data copy from D to H in the compute process is very strange. In addition, these Memcpy DtoH appear behind DeviceReduceKernel and DeviceReduceSingleTileKernel. These operations also appear in a Linear. I also want to know what is the connection between Memcpy DtoH and DeviceReduceSingleTileKernel, and why these two operations always appear together?