运行时报错 Error Code 1: Cuda Runtime (invalid device context)

Tendo33 commented 3 months ago

环境：

(ocr) ➜  OCRIntegrator git:(main) ✗ pip list | grep tensor     
nvidia-tensorrt           99.0.0
tensorrt                  10.0.1
tensorrt-cu12             10.0.1
tensorrt-cu12-bindings    10.0.1
tensorrt-cu12-libs        10.0.1

(ocr) ➜  OCRIntegrator git:(main) ✗ uvicorn main:app
2024-07-03 03:56:13 - INFO - /workspace/sunjinfeng/github_projet/OCRIntegrator/app/deepdoc/tensorrt_engine.py - line 13 - Start loading engine
[07/03/2024-03:56:13] [TRT] [W] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
2024-07-03 03:56:13 - INFO - /workspace/sunjinfeng/github_projet/OCRIntegrator/app/deepdoc/tensorrt_engine.py - line 17 - Completed loading /workspace/sunjinfeng/github_projet/OCRIntegrator/res/deepdoc/det.trt engine in 0.0303s
2024-07-03 03:56:13 - INFO - /workspace/sunjinfeng/github_projet/OCRIntegrator/app/deepdoc/tensorrt_engine.py - line 13 - Start loading engine
[07/03/2024-03:56:13] [TRT] [W] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
2024-07-03 03:56:13 - INFO - /workspace/sunjinfeng/github_projet/OCRIntegrator/app/deepdoc/tensorrt_engine.py - line 17 - Completed loading /workspace/sunjinfeng/github_projet/OCRIntegrator/res/deepdoc/rec.trt engine in 0.0233s
[07/03/2024-03:56:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaStream::43] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context)
[07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context)

Tendo33 commented 3 months ago

@peakhell @awesomeboy2 可否提供 pycuda 的版本

peakhell commented 3 months ago

pycuda_version pycuda==2024.1

peakhell commented 3 months ago

i will add this to my README

Tendo33 commented 3 months ago

还是报上述错误。。。环境完全一样，就只有我的 cuda 版本是 12.2 ，det 跟 rec 可以加载，是因为 pycuda没有初始化吗

peakhell commented 3 months ago

[07/03/2024-03:56:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaStream::43] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context) [07/03/2024-03:56:14] [TRT] [E] 1: [scopedCudaResources.cpp::~ScopedCudaEvent::20] Error Code 1: Cuda Runtime (invalid device context)

You have multiple GPUs, which may affect the management of CUDA contexts. You need to explicitly specify the GPU device to use. I do not have an environment with multiple GPUs; search load_model_cuda function，and change the cuda_context with follow, this may help

    import pycuda.driver as cuda
    device = cuda.Device(device_id)  # replace device_id with your device id
    context = device.make_context()
    model_file_path = os.path.join(model_dir, nm + ".trt")

    engine = load_engine(model_file_path)
    return TrtModelEngine(engine, cuda_ctx=context), TrtModelEngine.get_input_tensor_names(engine)

Tendo33 commented 3 months ago

I tried this approach, but it seems the same issue persists; however, the context stack was not empty and could not be released now.

I wrote a script to manually clear the stack.

import pycuda.driver as cuda
cuda.init()
num_devices = cuda.Device.count()

for device_id in range(num_devices):
    device = cuda.Device(device_id)
    context = device.make_context()
    while True:
        try:
            cuda.Context.pop()
        except cuda.LogicError:
            break
    context.detach()  

print(f"All contexts for {num_devices} devices have been cleaned up.")

It doesn't seem to be working.🤧

peakhell commented 3 months ago

I tried this approach, but it seems the same issue persists; however, the context stack was not empty and could not be released now.

I wrote a script to manually clear the stack.
import pycuda.driver as cuda
cuda.init()
num_devices = cuda.Device.count()

for device_id in range(num_devices):
    device = cuda.Device(device_id)
    context = device.make_context()
    while True:
        try:
            cuda.Context.pop()
        except cuda.LogicError:
            break
    context.detach()  

print(f"All contexts for {num_devices} devices have been cleaned up.")
It doesn't seem to be working.🤧

Manually close the CUDA context. I don't have a multi-GPU environment, but I don't think this is a difficult problem to solve. You just need to manually specify the GPU device and manage the CUDA context.

    try:
        model_file_path = os.path.join(model_dir, nm + ".trt")
        engine = load_engine(model_file_path)
        return TrtModelEngine(engine, cuda_ctx=context), TrtModelEngine.get_input_tensor_names(engine)
    finally:
        context.pop()

Tendo33 commented 3 months ago

It seems unrelated to whether it's a multi-GPU environment or not, as I'm still getting the same error even when a specific GPU is designated.😢

peakhell commented 3 months ago

It seems unrelated to whether it's a multi-GPU environment or not, as I'm still getting the same error even when a specific GPU is designated.😢

emmm, sorry that i can't help, i work fine in my envirement, Theoretically, as long as the CUDA context is managed correctly, the above issues should not occur.

PureWaterCatt commented 3 months ago

@Tendo33 I have the same environment, but everything is ok. I don't know if this will work for you, maybe you can try a 12.0.1-cudnn8-devel-ubuntu20.04 NVIDIA image and then installed python 3.11 and related repo dependencies on top of this image and it works fine. I did not modify the tensorrt_engine file, particularly with regard to configuring the CUDA context. Based on testing, it seems to use only my first GPU by default.

Please ignore their Memory usage and Volatile GPU-Util because it’s running my large model, I just used the third GPU to test it.

peakhell / OCRIntegrator

运行时报错 Error Code 1: Cuda Runtime (invalid device context) #4