Open mfbalin opened 4 years ago
@mfbalin Could you please share the reproducer? Are these queues bound to the same context?
They should be bound to the same context, I assume it is the CUDA context. However, you can verify that from the code.
Use of multiple queues bound to different cuda devices doesn't decrease runtime at all as there is no parallel use of different GPUs.
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.
I have some code that launches multiple kernels and distributes them on multiple queues which are for different CUDA devices. When only 1 gpu is used, we get the following dependency graph:
When the kernels are distributed among different devices, then we get the following graph:
I would expect the graph not to change much and I would expect no dependencies between different "tri_kernel" kernels. The kernels access buffers in read-only mode and these buffers are shared between different kernel launches. In the latter dependency graph, even though I am using multiple devices, I observe that only one of them runs at a time because of dependencies.