Closed tp-nan closed 6 months ago
你好,感谢宝贵意见。我考虑了一下,使用cuda stream通常用于重叠memory操作和kernel,现有code两个分支没有这种情形,感觉在这里不使用cuda stream为好。
感觉在这里不使用cuda stream为好。
I see~
这里拷贝使用了默认流(cuda stream 0),建议是使用和kernal一样的流。这样保持了kernal和拷贝在同一个流上, 避免与默认流同步。pytorch默认采用的就是默认流:https://torchpipe.github.io/zh/docs/preliminaries/pytorch_libtorch
另外CVCUDA也是将kernal和拷贝放在了同一个流上。它的内部没有负责流同步,也就是是异步的
ok,明白你意思了,在多线程+单卡/多卡情形下确实需要考虑这个同步问题。
Use cudaMemcpy2DAsync to avoid unnecessary stream synchronization, and to make stream created by cudaStreamNonBlocking to work. A cudaStreamSynchronize(stream) should be added before cudaMemcpy2D if stream is type of cudaStreamNonBlocking.
Note that cudaMemcpy includes a synchronization on the default stream. It does not include the equivalent of cudaDeviceSynchronize(). Normally, a synchronization on the default stream synchronizes all other created streams on that device. However, if you create a stream with the cudaStreamNonBlocking flag, that stream will not be synchronized by a synchronization in the default stream.
see here