Closed mtazzari closed 7 years ago
Yes, we should. But it's not top priority as it doesn't change the speed of a single transform
As suggested by Richard: Try compiling with the --default-stream per-thread
option!
here docs:
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
Ôh man, such an easy solution. I will create a PR for @mtazzari to test on his GPU. But we have to test that this also helps with multiple processes as that's the actual use case with emcee
An important point is raised in this blog post:
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-fortran/
If we use the default stream, we and nobody else can overlap copy and execution. As library authors, we should not use the default stream!