Cannot run OpenCV cuda operation and YOLO model prediction in different processes

reaganch commented 5 months ago

Search before asking

[X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Predict

Bug

I need to run OpenCV CUDA operations and YOLO prediction using TensorRT in separate processes of a Python script. Since I need to run the OpenCV operations on the GPU, I have built OpenCV (v4.9.0) with CUDA (v12.3.2) support. I have then uninstalled the version of OpenCV that is automatically installed along with ultralytics using pip, since it does not support CUDA operations.

When I run the OpenCV CUDA operation and YOLO TensorRT prediction in separate processes, I get the following error:

/home/<user>/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
Loading yolov8n.engine for TensorRT inference...
[04/01/2024-16:55:01] [TRT] [W] CUDA initialization failure with error: 304

If, on the other hand, I run the OpenCV CUDA operation and YOLO TensorRT prediction in the same process, it works as expected and I get no error messages:

Loading yolov8n.engine for TensorRT inference...
[04/01/2024-17:00:18] [TRT] [I] Loaded engine size: 19 MiB
[04/01/2024-17:00:18] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +19, now: CPU 0, GPU 19 (MiB)
[04/01/2024-17:00:18] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +20, now: CPU 0, GPU 39 (MiB)

image 1/1 <path>/bus.jpg: 640x640 4 persons, 1 bus, 4.3ms
Speed: 2.2ms preprocess, 4.3ms inference, 2.6ms postprocess per image at shape (1, 3, 640, 640)

Environment

Ultralytics YOLOv8.1.39 🚀 Python-3.10.12 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1070, 8111MiB)
Setup complete ✅ (12 CPUs, 15.4 GB RAM, 72.5/109.0 GB disk)

Minimal Reproducible Example

import cv2
from ultralytics import YOLO
from multiprocessing import Process

stream = cv2.cuda.Stream_Null()

def fn():
    #stream = cv2.cuda.Stream_Null()

    model = YOLO('yolov8n.engine', task='detect')
    #model = YOLO('yolov8n.pt', task='detect')
    results = model.predict('bus.jpg')

proc = Process(target=fn)
proc.start()

Additional

I tried switching the YOLO model from yolov8n.engine to yolov8n.pt to see what happens when I don't use TensorRT. In this case, when I run the OpenCV CUDA operation and YOLO prediction in separate processes, I get the message:

/home/<user>/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

image 1/1 <path>/bus.jpg: 640x480 4 persons, 1 bus, 1 stop sign, 33.0ms
Speed: 1.8ms preprocess, 33.0ms inference, 1.0ms postprocess per image at shape (1, 3, 640, 480)

When I run them in the same process, it again runs fine:

image 1/1 <path>/bus.jpg: 640x480 4 persons, 1 bus, 1 stop sign, 26.3ms
Speed: 1.6ms preprocess, 26.3ms inference, 1.7ms postprocess per image at shape (1, 3, 640, 480)

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

github-actions[bot] commented 5 months ago

👋 Hello @reaganch, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 5 months ago

@reaganch hey there! It looks like you're encountering CUDA initialization conflicts when running OpenCV with CUDA and YOLO model predictions in separate processes. A couple of suggestions to hopefully resolve this:

Ensure CUDA contexts aren't conflicting between different processes. CUDA contexts are tied to a specific process, leading to issues when CUDA operations initialized in one process are accessed from another.
Try initializing CUDA in each child process after starting it. Instead of initializing CUDA (or any CUDA-related operations) globally or before spawning the new processes, ensure that every necessary CUDA operation (including model loading) occurs inside each child process. Remember to explicitly manage CUDA contexts if needed.
Consider using torch.multiprocessing, a drop-in replacement for Python's multiprocessing module, which handles CUDA inter-process communication better.

Here's a rough idea of how you might adjust your multiprocessing setup:

import cv2
from ultralytics import YOLO
from torch.multiprocessing import Process, set_start_method

def fn():
    model = YOLO('yolov8n.engine', task='detect')
    results = model.predict('bus.jpg')

if __name__ == '__main__':
    set_start_method('spawn')  # Switch to 'spawn' to avoid issues with CUDA
    stream = cv2.cuda_Stream()  # Initialize CUDA operations here if necessary
    proc = Process(target=fn)   # Ensure model loading and CUDA operations are in here
    proc.start()
    proc.join()

Please make sure the CUDA context specific to each operation is correctly managed within the respective processes. And remember to use if __name__ == '__main__': to guard the multiprocessing start-up code in Python scripts. 😊

Let us know if this helps or if you encounter further issues!

reaganch commented 5 months ago

@glenn-jocher - Thanks a lot for that prompt response. Your suggested solution works like a charm!

glenn-jocher commented 5 months ago

@reaganch - You're welcome! 😊 Delighted to hear everything's working smoothly for you now. If any other questions or issues pop up, feel free to reach out. Happy coding!

ultralytics / ultralytics