Avoid unnecessary stream synchronization and fix stream of type cudaStreamNonBlocking

openppl-public / ppl.cv

ppl.cv is a high-performance image processing library of openPPL supporting various platforms.

Apache License 2.0

484 stars 108 forks source link

Avoid unnecessary stream synchronization and fix stream of type cudaStreamNonBlocking #131

Closed tp-nan closed 6 months ago

tp-nan commented 6 months ago

Use cudaMemcpy2DAsync to avoid unnecessary stream synchronization, and to make stream created by cudaStreamNonBlocking to work. A cudaStreamSynchronize(stream) should be added before cudaMemcpy2D if stream is type of cudaStreamNonBlocking.

Note that cudaMemcpy includes a synchronization on the default stream. It does not include the equivalent of cudaDeviceSynchronize(). Normally, a synchronization on the default stream synchronizes all other created streams on that device. However, if you create a stream with the cudaStreamNonBlocking flag, that stream will not be synchronized by a synchronization in the default stream.

see here

jimurk commented 6 months ago

你好，感谢宝贵意见。我考虑了一下，使用cuda stream通常用于重叠memory操作和kernel，现有code两个分支没有这种情形，感觉在这里不使用cuda stream为好。

tp-nan commented 6 months ago

感觉在这里不使用cuda stream为好。

I see~

这里拷贝使用了默认流（cuda stream 0），建议是使用和kernal一样的流。这样保持了kernal和拷贝在同一个流上, 避免与默认流同步。pytorch默认采用的就是默认流：https://torchpipe.github.io/zh/docs/preliminaries/pytorch_libtorch

另外CVCUDA也是将kernal和拷贝放在了同一个流上。它的内部没有负责流同步，也就是是异步的

jimurk commented 6 months ago

ok，明白你意思了，在多线程＋单卡/多卡情形下确实需要考虑这个同步问题。