src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
783 stars 144 forks source link

CUDNN_STATUS_MAPPING_ERROR when running directly after TensorRT #116

Open ghost opened 3 years ago

ghost commented 3 years ago

How can I run kmcuda synchronously after a tensorRT model performs inference on the same GPU (in a loop)?

For instance, I already am allocating pagelocked buffers for my tensorRT model, but I don't explicitly allocate anything upfront for kmeans_cuda to run on. Doesn't that mean there might be a conflict if both processes are accessing the GPU and don't totally "cleanup" after themselves?

The error I get the next time tensorRT runs (only after kmcuda runs):

[TensorRT] ERROR: ../rtSafe/safeContext.cpp (133) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR) [TensorRT] ERROR: FAILED_EXECUTION: std::exception reported here: https://github.com/NVIDIA/TensorRT/issues/303

So I guess in general my question is how should/can I cleanup after kmcuda runs? The reason I think some how preallocating buffers would help is because a very similar SO issue reported that as the solution (for tensorflow and tensorRT on the same GPU)

Environment:

nvcr.io/nvidia/l4t-base:r32.4.4 cuda-10.2 tensorRT 7.1.3

ghost commented 3 years ago

What I do know is this problem can be solved with isolating tensorRT from kmeans_cuda.

Here's how I've hackily fixed it: I simply run the tensorRT inference (with all it's pagelocked allocation, engine, stream, context, etc.) in one thread and then run kmeans_cuda in a separate thread. A thread-safe queue is used to pass the inference results through to the other thread that runs kmeans. There - isolation! No more errors.

But I have no idea why this works, and it feels extremely hacky. Are devs willing to comment on best practices and caveats to running kmeans_cuda synchronously with other calls to the GPU (using tensorRT or otherwise)?

futureisatyourhand commented 1 year ago

I also encountered the same problem, but I loaded two trt models at the same time. My method is: first, mapping the torch2trt inc and lib paths to the include and lib paths corresponding to tensorrt(e.g, TensorRT 8.2.3,TensorRT 7.1); then, separating the two trt models initializing with two classes,respectively. Finally, I use one class to use two model class respectively . Note: when you execute every forward or call,you need add the code(torch.cuda.set_device('cuda:0')). My problem is solved, and the stress test also passed. My method is succeed for these environments(TensorRT 7.1.2 and torch2trt 0.3.0, TensorRT 8.2.3 and torch2trt-0.4.0).

I think these is a resource and GPU contention issue.