ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.58k stars 16.1k forks source link

Multiple threads using yolov5 model concurrent inference failed #13257

Open cucyuan opened 4 weeks ago

cucyuan commented 4 weeks ago

Search before asking

Question

When I used the yolov5 model for concurrent inference in two threads, after running for some time, the cuda occupancy suddenly reached 100%, and then the program stuck. In addition, I've set inplace=False, but it doesn't work, i hope get help, thank you!!! 1723550610576

Additional

No response

github-actions[bot] commented 4 weeks ago

👋 Hello @cucyuan, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
glenn-jocher commented 4 weeks ago

@cucyuan hello,

Thank you for reaching out and providing detailed information about your issue. Running concurrent inferences with YOLOv5 can sometimes lead to resource contention, especially with CUDA.

Here are a few suggestions to help resolve this issue:

  1. Update to the Latest Version: Ensure you are using the latest version of YOLOv5 and all related dependencies. You can update YOLOv5 by pulling the latest changes from the repository:

    git pull
  2. CUDA and cuDNN Versions: Verify that your CUDA and cuDNN versions are compatible with your PyTorch installation. Sometimes mismatches can cause unexpected behavior.

  3. Environment Variables: Set the following environment variables to help manage CUDA resources better:

    import os
    os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'  # Adjust based on your GPU setup
  4. Thread Management: Instead of using threads, consider using multiprocessing. Python's Global Interpreter Lock (GIL) can sometimes cause issues with threading in CPU-bound tasks. Here's an example using multiprocessing:

    import torch.multiprocessing as mp
    
    def inference_process(model, data):
        results = model(data)
        return results
    
    if __name__ == '__main__':
        model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
        model.eval()
    
        data = ...  # Your input data
        processes = []
        for _ in range(2):  # Number of concurrent processes
            p = mp.Process(target=inference_process, args=(model, data))
            p.start()
            processes.append(p)
    
        for p in processes:
            p.join()
  5. Memory Management: Ensure that your GPU has enough memory to handle multiple inferences. You can monitor GPU memory usage using tools like nvidia-smi.

If the issue persists, please provide additional details such as your environment setup (e.g., CUDA version, PyTorch version, GPU model) and any error messages you encounter. This will help in diagnosing the problem more accurately.

Thank you for your patience, and I hope this helps! If you have any further questions, feel free to ask.

cucyuan commented 4 weeks ago

5. nvidia-smi

Thank you for the solutions you provided. I have tried them all, but the problem is still not solved. My GPU model is A2000, CUDA version is 11.7, torch version is 2.0.0+cu117, torchvision version is 0.15.0+cu117, and the GPU memory is not fully occupied during running. In addition, I can run the program in the same environment with the RTX 4080 without the above problems, which is very strange.

glenn-jocher commented 4 weeks ago

Hello @cucyuan,

Thank you for providing additional details about your setup. It's interesting that the issue does not occur with the RTX 4080 but does with the A2000. This suggests that the problem might be related to specific hardware or driver configurations.

Here are a few more steps you can take to troubleshoot and potentially resolve the issue:

  1. Driver Update: Ensure that your NVIDIA drivers are up-to-date. Sometimes, driver updates can resolve compatibility issues with specific hardware models.

  2. CUDA Toolkit: Verify that the CUDA toolkit is correctly installed and configured. You can check the CUDA version with:

    nvcc --version
  3. PyTorch Compatibility: Although you mentioned using PyTorch 2.0.0+cu117, it might be worth trying a different version of PyTorch to see if the issue persists. You can install a different version using:

    pip install torch==1.12.1+cu117 torchvision==0.13.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html
  4. Environment Isolation: Create a new virtual environment to ensure there are no conflicting dependencies. You can do this using venv or conda:

    python -m venv yolov5-env
    source yolov5-env/bin/activate  # On Windows use `yolov5-env\Scripts\activate`
    pip install -r requirements.txt
  5. Profiling and Debugging: Use profiling tools to identify where the bottleneck or issue might be occurring. You can use torch.profiler to get detailed insights:

    import torch
    from torch.profiler import profile, record_function, ProfilerActivity
    
    model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
    model.eval()
    
    data = ...  # Your input data
    
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
        with record_function("model_inference"):
            results = model(data)
    
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
  6. Community Feedback: Sometimes, specific hardware issues might have been encountered by others in the community. Consider searching or posting on forums like PyTorch Forums or NVIDIA Developer Forums.

If the issue continues to persist, it might be beneficial to open a detailed issue on the YOLOv5 GitHub repository with all the relevant details. This will allow the community and developers to provide more targeted assistance.

Thank you for your patience and for being a part of the YOLO community! If you have any further questions, feel free to ask. 😊