[Question]: Relationship Between TensorRT-YOLO Inference Precision Discrepancy and EfficientNMS Plugin Exported Engine

timarnoldev commented 2 months ago

Running the exact same model (tested with v8, v9 and v10) with this library results in very bad detection results. I converted them from .pt -> .onnx using trtyolo and the custom yolov10 repo respectively.

Some object classes aren't detected that all. Sometimes there are rare detections of random objects with very low confidences. On the other hand running the model using tensorrt in python delivers perfect results.

TensorRT-YOLO

Python Tensorrt Inference

Any ideas what the problem might be?

Export command I used:

trtyolo export -w best.pt -v yolov8 -o output --max_boxes 100 --iou_thres 0.45 --conf_thres 0.15 -b -1

laugh12321 commented 2 months ago

@timarnoldev Below is an inference accuracy comparison using YOLOv8s as an example. It can be observed that TensorRT-YOLO inference engine has slight accuracy loss compared to Ultralytics inference pt. The preprocessing methods used in tensorrt_yolo version 3.0 and TensorRT-YOLO are letterbox and gpuBilinearWarpAffine respectively. Their inference results show consistent accuracy, and the implementation of letterbox in tensorrt_yolo version 3.0 is derived from Ultralytics' LetterBox. Therefore, it can be concluded that the main cause of accuracy loss in inference is due to the engine exported by trtexec with the Efficient NMS Plugin. Reference: Efficient NMS Plugin Limitations.

As for the significant accuracy discrepancy you have shown, I find it puzzling. Please check if your processing workflow is consistent with mine.

Ultralytics FP16

from ultralytics import YOLO
model = YOLO("D:/Models/YOLOv8/yolov8s.pt")
model.predict("D:/Downloads/coco128/images/train2017/000000000077.jpg", save=True, imgsz=640, conf=0.25, iou=0.45, max_det=100, half=True, device="0")

000000000077

Ultralytics FP32

from ultralytics import YOLO
model = YOLO("D:/Models/YOLOv8/yolov8s.pt")
model.predict("D:/Downloads/coco128/images/train2017/000000000077.jpg", save=True, imgsz=640, conf=0.25, iou=0.45, max_det=100, device="0")

000000000077

TensorRT-YOLO FP16

trtyolo export -w yolov8s.pt -v yolov8 --imgsz 640 -b 1 --max_boxes 100 --iou_thres 0.45 --conf_thres 0.25 -o ./ -s
trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s-fp16.engine --fp16
xmake run -P . detect -e D:/Models/YOLOv8/yolov8s-fp16.engine -i D:/Downloads/coco128/images/train2017/000000000077.jpg -o ./ -l labels.txt

000000000077-fp16

TensorRT-YOLO FP32

trtyolo export -w yolov8s.pt -v yolov8 --imgsz 640 -b 1 --max_boxes 100 --iou_thres 0.45 --conf_thres 0.25 -o ./ -s
trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s-fp32.engine
xmake run -P . detect -e D:/Models/YOLOv8/yolov8s-fp32.engine -i D:/Downloads/coco128/images/train2017/000000000077.jpg -o ./ -l labels.txt

000000000077-fp32

trtyolo 3.0.2 FP16

pip install tensorrt_yolo
trtyolo export -w yolov8s.pt -v yolov8 --imgsz 640 -b 1 --max_boxes 100 --iou_thres 0.45 --conf_thres 0.25 -o ./ -s
trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s-fp16.engine --fp16
trtyolo infer -e D:/Models/YOLOv8/yolov8s-fp16.engine -i D:/Downloads/coco128/images/train2017/000000000077.jpg -o ./ -l labels.txt

000000000077-py16

trtyolo 3.0.2 FP32

pip install tensorrt_yolo
trtyolo export -w yolov8s.pt -v yolov8 --imgsz 640 -b 1 --max_boxes 100 --iou_thres 0.45 --conf_thres 0.25 -o ./ -s
trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s-fp32.engine
trtyolo infer -e D:/Models/YOLOv8/yolov8s-fp32.engine -i D:/Downloads/coco128/images/train2017/000000000077.jpg -o ./ -l labels.txt

000000000077-py32

timarnoldev commented 2 months ago

Thanks for your detailed reply @laugh12321 I use the engine file for the python version as well as the c++ version but with your EfficientNMS plugin as discussed in #38 Is it mandatory to use this plugin? Nvidia seems to have deprecated it: https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/efficientNMSPlugin/README.md

Maybe the problem is in the way preprocessing is handled. I used Roboflow for dataset preparation which compresses the images down to 640x640. How is image scaling managed in TonsorRT-YOLO?

laugh12321 commented 2 months ago

@timarnoldev For the TensorRT-YOLO project, the EfficientNMS plugin is mandatory. It replaces the use of CUDA Kernels for post-processing, thereby improving inference speed. In future versions, we will consider replacing the EfficientNMS plugin with the INMSLayer.

Regarding the preprocessing operations mentioned for Roboflow, I am not very familiar with them. I use the same preprocessing method as Ultralytics, which involves scaling the images while maintaining the aspect ratio. You can refer to this article for more information: https://medium.com/@mattia.digiusto/optimising-image-pre-processing-in-python-ac9157951bf6

timarnoldev commented 2 months ago

Regarding the preprocessing operations mentioned for Roboflow, I am not very familiar with them. I use the same preprocessing method as Ultralytics, which involves scaling the images while maintaining the aspect ratio. You can refer to this article for more information: https://medium.com/@mattia.digiusto/optimising-image-pre-processing-in-python-ac9157951bf6

I just checked, I'm also using letterbox for the python program. Btw this is how i converted .pt > .onnx back then: https://github.com/triple-Mu/YOLOv8-TensorRT?tab=readme-ov-file#export-end2end-onnx-with-nms

laugh12321 commented 2 months ago

@timarnoldev The conversion methods for TensorRT-YOLO and YOLOv8-TensorRT models are the same, and the inference results are identical.

Convert with YOLOv8-TensorRT, Inference with TensorRT-YOLO FP16

git clone https://github.com/triple-Mu/YOLOv8-TensorRT.git
cd YOLOv8-TensorRT
python export-det.py --weights D:\Models\YOLOv8\yolov8s.pt --iou-thres 0.45 --conf-thres 0.25 --topk 100 --opset 11 --sim --input-shape 1 3 640 640 --device cuda:0
cd D:\Models\YOLOv8
trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s-fp16.engine --fp16
xmake run -P . detect -e D:/Models/YOLOv8/yolov8s-fp16.engine -i D:/Downloads/coco128/images/train2017/000000000077.jpg -o ./ -l labels.txt

000000000077

In the image below, yolov8s.onnx was exported using YOLOv8-TensorRT, while yolov8s-old.onnx was exported using TensorRT-YOLO. The two models are identical except for the output node names.

timarnoldev commented 2 months ago

Good to know, thank you. Can the tensorrt version be the issue? With python I used 8.6.1 and now 10.2.0 I'm so confused right now because in both ways I used the exact same .pt file.

laugh12321 commented 2 months ago

@timarnoldev It shouldn't be an issue with the TensorRT version. The precision should be the same between 8.6.1 and 10.2.0. To better diagnose the problem, could you please send me your .pt model and test data in a ZIP archive? This way, I can help you analyze it in more detail.

timarnoldev commented 2 months ago

best.pt.zip YOLO.zip

I also just found out, triple-Mu/YOLOv8-TensorRT also has a c++ inference example. This works just fine with the expected results, but only supports yolov8.

laugh12321 commented 2 months ago

@timarnoldev I used the provided model and data to perform inference with TensorRT-YOLO, and the accuracy of the results is normal. The vis.zip file contains the visualized results of the inference.

timarnoldev commented 2 months ago

That is strange. This is the code I used for inference. Is maybe here the problem?

#include <QCoreApplication>
#include <iostream>
#include <opencv2/opencv.hpp>
#include "AIWorker.h"
#include "tensorrt/deploy/vision/detection.hpp"

AIWorker::AIWorker(QObject *parent)
        : QObject(parent) {
    m_running = false;
}

std::vector<std::pair<std::string, cv::Scalar>> generateLabelColorPairs() {
    std::vector<std::pair<std::string, cv::Scalar>> labelColorPairs;

    auto generateRandomColor = []() {
        std::random_device                 rd;
        std::mt19937                       gen(rd());
        std::uniform_int_distribution<int> dis(0, 255);
        return cv::Scalar(dis(gen), dis(gen), dis(gen));
    };

    labelColorPairs.emplace_back("ball", generateRandomColor());
    labelColorPairs.emplace_back("player_red", generateRandomColor());

    return labelColorPairs;
}

// Visualize detection results
void visualize(cv::Mat& image, const deploy::DetectionResult& result, const std::vector<std::pair<std::string, cv::Scalar>>& labelColorPairs) {
    for (size_t i = 0; i < result.num; ++i) {
        const auto& box       = result.boxes[i];
        int         cls       = result.classes[i];
        float       score     = result.scores[i];
        const auto& label     = labelColorPairs[cls].first;
        const auto& color     = labelColorPairs[cls].second;
        std::string labelText = label + " " + cv::format("%.2f", score);

        // Draw rectangle and label
        int      baseLine;
        cv::Size labelSize = cv::getTextSize(labelText, cv::FONT_HERSHEY_SIMPLEX, 0.6, 1, &baseLine);
        cv::rectangle(image, cv::Point(box.left, box.top), cv::Point(box.right, box.bottom), color, 2, cv::LINE_AA);
        cv::rectangle(image, cv::Point(box.left, box.top - labelSize.height), cv::Point(box.left + labelSize.width, box.top), color, -1);
        cv::putText(image, labelText, cv::Point(box.left, box.top), cv::FONT_HERSHEY_SIMPLEX, 0.6, cv::Scalar(255, 255, 255), 1);
    }
}

void AIWorker::Start() {
    m_running = true;

    std::shared_ptr<deploy::BaseDet> model = std::make_shared<deploy::DeployDet>("../ai.engine");
    std::vector<std::pair<std::string, cv::Scalar>> labels = generateLabelColorPairs();

    while (m_running) {
        QCoreApplication::processEvents( QEventLoop::WaitForMoreEvents,1);

        deploy::Image image(currentImage.data, currentImage.cols, currentImage.rows);
        auto result = model->predict(image);
        visualize(currentImage, result, labels);

        emit imageAnalyzed(currentImage, currentImageID);
        //cv::imshow("AI", currentImage);

        nextid++;

    }
}

int AIWorker::getCurrentImageId() {
    return nextid;
}

void AIWorker::onImageReceivedAr(cv::Mat image, int id) {

    this->currentImage = image.clone();
    this->currentImageID = id;
    this->imagePresent = true;
}

void AIWorker::stop() {
    m_running = false;

}

laugh12321 commented 2 months ago

@timarnoldev You might want to first try using TensorRT-YOLO's demo/detect to verify if the accuracy is correct. All the C++ inference results we discussed earlier were obtained using this demo/detect.

laugh12321 commented 2 months ago

Thanks for your detailed reply @laugh12321 I use the engine file for the python version as well as the c++ version but with your EfficientNMS plugin as discussed in #38 Is it mandatory to use this plugin? Nvidia seems to have deprecated it: https://github.com/NVIDIA/TensorRT/blob/release/10.1/plugin/efficientNMSPlugin/README.md

Maybe the problem is in the way preprocessing is handled. I used Roboflow for dataset preparation which compresses the images down to 640x640. How is image scaling managed in TonsorRT-YOLO?

@timarnoldev To clarify, NVIDIA/TensorRT has deprecated the EfficientNMSONNXPlugin plugin, not the EfficientNMS_TRT plugin. In fact, the efficientNMSPlugin defines two plugins: EfficientNMS_TRT and EfficientNMSONNXPlugin.

Based on testing, the inference accuracy is consistent whether using EfficientNMS_TRT, EfficientNMSONNXPlugin, or INMSLayer.

timarnoldev commented 2 months ago

I just found the problem. I preprocessed the images on myself which didn't line up with the training. Thank you very much for your help.

laugh12321 / TensorRT-YOLO