ultralytics / ultralytics

NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
26.29k stars 5.23k forks source link

Modify Yolov8 output size #14015

Open tjasmin111 opened 3 weeks ago

tjasmin111 commented 3 weeks ago

Search before asking

Question

I trained a yolov8 model and here is the output size when I visualize with Netron. Is there any ways to decrease the number of detections (19320)? Is this just the total number of potential detections? On the device, it takes a long time to parse through the these 19320 candidates and I need to make it faster. I can not decrease the input resolution.

image

Additional

No response

glenn-jocher commented 3 weeks ago

@tjasmin111 hello,

Thank you for reaching out with your question regarding the YOLOv8 output size. The number of detections (19320) you are seeing represents the total number of potential detections across all feature map levels. This is indeed a common characteristic of the YOLO architecture, where multiple anchor points are evaluated across different scales.

To address your concern about the high number of candidates and the associated parsing time, here are a few suggestions:

  1. Post-Processing Optimization: You can optimize the post-processing step by adjusting the confidence threshold and Non-Maximum Suppression (NMS) parameters. This will help in filtering out low-confidence detections early on, reducing the number of candidates that need to be processed further.

    results = model.predict(source='path/to/image.jpg', conf=0.5, iou=0.4)
  2. Model Pruning: Consider pruning the model to remove less significant weights, which can help in speeding up inference without significantly compromising accuracy. This is an advanced technique and might require some experimentation.

  3. Custom Model Architecture: If feasible, you can modify the model architecture to reduce the number of anchor points or feature map sizes. This would involve changing the configuration files and retraining the model.

  4. Hardware Acceleration: Ensure that you are utilizing hardware acceleration options like TensorRT, which can significantly speed up inference times on compatible devices. You can export your model to TensorRT format as follows:

    yolo export model=path/to/best.pt format=engine

If you haven't already, please ensure you are using the latest version of the Ultralytics packages, as updates often include performance improvements and bug fixes.

For a more detailed guide on creating a minimum reproducible example, please refer to this documentation.

I hope these suggestions help! If you have any further questions, feel free to ask.

tjasmin111 commented 3 weeks ago

I have this C++ post-processing parsing code that loops on the candidates to get detections info. Is there a way to optimize this? Do we have a better way to do it?


   // Prepare containers for detected boxes, confidences, and class IDs
    std::vector<cv::Rect> boxes;
    std::vector<float> confidences;
    std::vector<int> class_ids;

    // Iterate through the rows of the output array
for (int i = 0; i < cols; i++) {
  cv::Mat classes_scores = output.col(i).rowRange(4, output.rows);

  // Find the class with the maximum score (if one exists above threshold)
  double maxScore;
  cv::Point maxClassLoc;

  // Check for non-zero elements before calling minMaxLoc (optimization)
  if (cv::countNonZero(classes_scores) > 0) {
    // Call minMaxLoc to find minimum and maximum values
    cv::minMaxLoc(classes_scores, nullptr, &maxScore, nullptr, &maxClassLoc);

    if (maxScore >= 0.25) {
      int maxClassIndex = maxClassLoc.y;

      // Pre-fetch bounding box coordinates from output (optimization)
      float x = output.at<float>(0, i);
      float y = output.at<float>(1, i);
      float w = output.at<float>(2, i);
      float h = output.at<float>(3, i);

      // Calculate absolute coordinates
      int x_scaled = static_cast<int>((x - 0.5f * w) * scale);
      int y_scaled = static_cast<int>((y - 0.5f * h) * scale);
      int w_scaled = static_cast<int>(w * scale);
      int h_scaled = static_cast<int>(h * scale);

      cv::Rect box(x_scaled, y_scaled, w_scaled, h_scaled);
      boxes.push_back(box);
      confidences.push_back(static_cast<float>(maxScore));
      class_ids.push_back(maxClassIndex);
    }
  }
}

    // Apply Non-Maximum Suppression to remove redundant overlapping boxes
    std::vector<int> indices;
    cv::dnn::NMSBoxes(boxes, confidences, conf_thres, nms_thres, indices, 0.5);
glenn-jocher commented 3 weeks ago

Hello @tjasmin111,

Thank you for sharing your C++ post-processing code! Optimizing the parsing of detection candidates can indeed help improve performance. Here are a few suggestions to enhance your current implementation:

  1. Batch Processing: If possible, process multiple candidates in parallel using multi-threading or SIMD (Single Instruction, Multiple Data) instructions. This can significantly speed up the loop.

  2. Early Exit for Low Scores: Instead of checking for non-zero elements, you could directly check if the maximum score is above the threshold. This avoids unnecessary operations for low-confidence detections.

  3. Efficient Memory Access: Ensure that memory access patterns are optimized. Accessing memory in a contiguous manner can help improve cache performance.

  4. Reduce Redundant Calculations: Pre-compute values that are used multiple times within the loop to avoid redundant calculations.

Here is a revised version of your code with some of these optimizations:

// Prepare containers for detected boxes, confidences, and class IDs
std::vector<cv::Rect> boxes;
std::vector<float> confidences;
std::vector<int> class_ids;

// Iterate through the rows of the output array
for (int i = 0; i < cols; i++) {
    cv::Mat classes_scores = output.col(i).rowRange(4, output.rows);

    // Find the class with the maximum score (if one exists above threshold)
    double maxScore;
    cv::Point maxClassLoc;

    // Directly find the maximum score and its location
    cv::minMaxLoc(classes_scores, nullptr, &maxScore, nullptr, &maxClassLoc);

    if (maxScore >= 0.25) {
        int maxClassIndex = maxClassLoc.y;

        // Pre-fetch bounding box coordinates from output (optimization)
        float x = output.at<float>(0, i);
        float y = output.at<float>(1, i);
        float w = output.at<float>(2, i);
        float h = output.at<float>(3, i);

        // Calculate absolute coordinates
        int x_scaled = static_cast<int>((x - 0.5f * w) * scale);
        int y_scaled = static_cast<int>((y - 0.5f * h) * scale);
        int w_scaled = static_cast<int>(w * scale);
        int h_scaled = static_cast<int>(h * scale);

        cv::Rect box(x_scaled, y_scaled, w_scaled, h_scaled);
        boxes.push_back(box);
        confidences.push_back(static_cast<float>(maxScore));
        class_ids.push_back(maxClassIndex);
    }
}

// Apply Non-Maximum Suppression to remove redundant overlapping boxes
std::vector<int> indices;
cv::dnn::NMSBoxes(boxes, confidences, conf_thres, nms_thres, indices, 0.5);

Additionally, you can consider leveraging hardware acceleration libraries such as OpenVINO or TensorRT for faster inference and post-processing. These libraries are optimized for performance on various hardware platforms.

For more detailed guidance on optimizing your YOLOv8 implementation, you can refer to the YOLO Common Issues Guide.

I hope this helps! If you have any further questions, feel free to ask. 😊

tjasmin111 commented 2 weeks ago

Thanks. Back to the original question, is there anything I can do to decrease the output vector size 19320? That is the number of candidates. I'm thinking we don't need that many.

image

glenn-jocher commented 2 weeks ago

Hello @tjasmin111,

Thank you for your patience and for providing additional context. Reducing the number of candidates in the output vector can indeed help improve performance. Here are a few strategies you can consider:

  1. Anchor Reduction: Modify the YOLO model architecture to reduce the number of anchors. This involves changing the anchor settings in the model configuration file. Fewer anchors will result in fewer candidate boxes.

  2. Feature Map Sizes: Adjust the sizes of the feature maps used for detection. By reducing the resolution of these feature maps, you can decrease the number of candidate boxes. This can be done by modifying the strides or the number of layers in the model.

  3. Custom Model Architecture: Create a custom YOLO model with fewer layers or different configurations that naturally produce fewer candidate boxes. This might involve significant changes to the model architecture and retraining.

  4. Post-Processing Optimization: As mentioned earlier, optimizing the post-processing step by adjusting the confidence threshold and Non-Maximum Suppression (NMS) parameters can help reduce the number of candidates that need to be processed.

Here is an example of how you might adjust the anchor settings in the model configuration file:

anchors:
  - [10, 13, 16, 30, 33, 23]  # P3/8
  - [30, 61, 62, 45, 59, 119]  # P4/16
  - [116, 90, 156, 198, 373, 326]  # P5/32

You can reduce the number of anchors in each scale to decrease the number of candidates.

Additionally, if you haven't already, please ensure you are using the latest version of the Ultralytics packages, as updates often include performance improvements and bug fixes.

For more detailed guidance on creating a minimum reproducible example, please refer to this documentation.

I hope these suggestions help! If you have any further questions, feel free to ask. 😊

tjasmin111 commented 2 weeks ago

Thanks. As for the anchors, what does each number represent? How to reduce it say for Nano scale?

anchors:
  - [10, 13, 16, 30, 33, 23]  # P3/8
  - [30, 61, 62, 45, 59, 119]  # P4/16
  - [116, 90, 156, 198, 373, 326]  # P5/32
glenn-jocher commented 2 weeks ago

Hello @tjasmin111,

Great question! The numbers in the anchors array represent the width and height of the anchor boxes at different scales. Each pair of numbers corresponds to an anchor box size used for detecting objects at a specific feature map level.

Here's a breakdown of what each number represents:

To reduce the number of anchors for the Nano scale, you can simply reduce the number of anchor pairs in each scale. For example, you could modify the anchors as follows:

anchors:
  - [10, 13, 16, 30]  # P3/8
  - [30, 61, 62, 45]  # P4/16
  - [116, 90, 156, 198]  # P5/32

This reduces the number of anchors, which should help decrease the number of candidate boxes and potentially improve performance.

If you encounter any issues or need further assistance, please ensure you are using the latest version of the Ultralytics packages and provide a minimum reproducible example as outlined here.

I hope this helps! If you have any further questions, feel free to ask. 😊