marcoslucianops / DeepStream-Yolo

NVIDIA DeepStream SDK 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models
MIT License
1.48k stars 358 forks source link

About low accuracy on converted models #339

Open marcoslucianops opened 1 year ago

marcoslucianops commented 1 year ago

I evaluated the mAP between get_wts model and ONNX model and both faced accuracy drop on TensorRT conversion. The conclusion is that the TensorRT drops the accuracy when optimizing the layers.

YOLOv8n ONNX:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.343
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.492
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.373
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.178
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.381
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.471
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.295
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.488
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.542
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.599
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.700

YOLOv8n get_wts_yolov8.py

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.343
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.491
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.372
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.178
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.381
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.295
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.488
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.542
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.599
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.699
huytranvan2010 commented 1 year ago

@marcoslucianops could you please share code to evaluate .engine model?

marcoslucianops commented 1 year ago

@marcoslucianops could you please share code to evaluate .engine model?

I will share it in the future.

huytranvan2010 commented 1 year ago

@marcoslucianops could you please share code to evaluate .engine model?

Do I need to use file "libnvdsinfer_custom_impl_Yolo.so" generated from command "CUDA_VER=11.8 make -C nvdsinfer_custom_impl_Yolo" for evaluation or only use .engine model?

marcoslucianops commented 1 year ago

My eval code is created based on deepstream_python_apps with some custom implementations (image batch input, pycocotools, etc). It uses DeepStream to generate the JSON to be evaluated by pycocotools.

huytranvan2010 commented 1 year ago

My eval code is created based on deepstream_python_apps with some custom implementations (image batch input, pycocotools, etc). It uses DeepStream to generate the JSON to be evaluated by pycocotools.

I inference for each image in COCO val, collect labels to generate json file. But I got low mAP for yolov7 fp32 .engine model: mAP0.5:0.95 = 0.4 mAP0.5 = 0.538 mAP0.75 = 0.435 It is too low compared to your benchmark, even if you use only yolov6 fp16 .engine model

marcoslucianops commented 1 year ago

In the models I've tested, there's no mAP difference between FP32 and FP16 engines. Are you using the DeepStream to output the bboxes?

huytranvan2010 commented 1 year ago

In the models I've tested, there's no mAP difference between FP32 and FP16 engines. Are you using the DeepStream to output the bboxes?

Yes. I run deepstream app for images and save output (labels) in a file by setting gie-kitti-output-dir. Then I collected labels and generated json files to evaluate. My mAP is too low.

marcoslucianops commented 1 year ago

In the kitti output, the bboxes coordinates are related to the streammux resolution you set. You need to change them according to each validation image resolution.

huytranvan2010 commented 1 year ago

In the kitti output, the bboxes coordinates are related to the streammux resolution you set. You need to change them according to each validation image resolution.

Yes, I recognized that, and also changed to image size, but mAP is too low.

marcoslucianops commented 1 year ago

Did you set

[class-attrs-all]
nms-iou-threshold=0.65
pre-cluster-threshold=0.001
topk=300

In the config_infer_primary_yoloV7.txt file?

huytranvan2010 commented 1 year ago

Did you set

[class-attrs-all]
nms-iou-threshold=0.65
pre-cluster-threshold=0.001
topk=300

In the config_infer_primary_yoloV7.txt file?

Did you use the above config to receive benchmark? I used default set up.

nms-iou-threshold=0.45
pre-cluster-threshold=0.25
topk=300
marcoslucianops commented 1 year ago

The evaluation uses different NMS and confidence thresholds. Try with the values I sent.

huytranvan2010 commented 1 year ago

The evaluation uses different NMS and confidence thresholds. Try with the values I sent.

Thanks a lot for supporting me. I am going to try it now😍

huytranvan2010 commented 1 year ago

Did you set

[class-attrs-all]
nms-iou-threshold=0.65
pre-cluster-threshold=0.001
topk=300

In the config_infer_primary_yoloV7.txt file?

I used this set up, mAP is better, but it is still lower than your benchmark for YOLOv7. Here is my result for fp32 .engine model

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.623
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.485

I attached my config ( I have many config as final_config_1.txt) config_infer_primary_yoloV7.txt final_config_1.txt

marcoslucianops commented 1 year ago

My eval code is fine-adjusted for extract the better mAP using DeepStream, that's why I got a bit more mAP.

huytranvan2010 commented 1 year ago

In the models I've tested, there's no mAP difference between FP32 and FP16 engines. Are you using the DeepStream to output the bboxes?

@marcoslucianops Do You mean Yolov7 model? I saw that your fp16 .engine model has mAP0.5:0.95 = 0.476, it means that mAP0.5:0.95 (of fp32 .engine model) = 0.476. It is too low compared with reference .pt model mAP0.5:0.95 = 0.514 https://github.com/WongKinYiu/yolov7#performance

marcoslucianops commented 1 year ago

There's a drop on TensorRT compared to the PyTorch model. In some models, it's a relevant drop. In other models (like PPYOLOE and YOLO-NAS), it's a small. The test I did I was comparing the ONNX export method with the wts and cfg export method. There's no drop between those two export methods.

huytranvan2010 commented 1 year ago

There's a drop on TensorRT compared to the PyTorch model. In some models, it's a relevant drop. In other models (like PPYOLOE and YOLO-NAS), it's a small. The test I did I was comparing the ONNX export method with the wts and cfg export method. There's no drop between those two export methods.

Thanks a lot. I expect fp32 is not drop mAP much. If mAP of fp32 or fp16 drop much, so mAP of int8 is still lower.

marcoslucianops commented 1 year ago

The FP16 and FP32 mAP are equal.

huytranvan2010 commented 1 year ago

The FP16 and FP32 mAP are equal.

Yeah, I think so. In your opinion, what is the reason of fp32, fp16's mAP big drop compared with .pt models? I mean some models included yolov7. I saw that yolov7 fp16 is dropped about 4%.

marcoslucianops commented 1 year ago

In my opinion, TensorRT layers are performance focused, making some tweaks to precisions and parameters. So it's faster, but loses some of the accuracy.

huytranvan2010 commented 1 year ago

In my opinion, TensorRT layers are performance focused, making some tweaks to precisions and parameters. So it's faster, but loses some of the accuracy.

Thanks for sharing.

cgrtrifork commented 1 year ago

Could this be related to inputs being different, not only TensorRT tweaks? For instance, in YOLOv8 it looks like symmetric padding is done with a grayscale value rather than with black color like DeepStream's nvstreammux does.

Edit: I also saw the following warning when running with exported ONNX models. Could this be another reason for the drop in performance? Is it possible to export using INT32 instead of INT64?

WARNING: [TRT]: onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: [TRT]: Tensor DataType is determined at build time for tensors not marked as input or output.
WARNING: [TRT]: onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped

In any case, it would be good to have a table of the expected drop for each of the models, as a reference.

WangFengtu1996 commented 9 months ago

@cgrtrifork anything update? I have same warning that Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. How to slove the problem.

WangFengtu1996 commented 9 months ago

when I inference yolov8s in Deepstream-6.3 in nvidia agx orin DK, I have some question. fp32 gpu ~30fps fp16 gpu+dla0 ~11fps and it's a relevant drop.

would someone give me some explain and guide ?

cgrtrifork commented 9 months ago

I ran the following experiment: I am trying out YOLOv8 object detection on an image that contains an object.

  1. I used this repository to export the model to onnx. Then using ffmpeg I generated a single-frame video, that I feed into DeepStream with a confidence threshold pre-cluster-threshold=0.2.
  1. I used Triton Inference Server to serve the same TensorRT model that is generated when running DeepStream. Then I ran the inference on the same image. I preprocessed the image to get a 3x640x640 image of float32 between 0 and 1 in RGB format, as it is expected by the model.

Having used the same TensorRT model, this makes me think there is an issue either on the parsing and interpretation of the output from the model, or deeper in a lower level DeepStream preprocessing of the image.

  1. Why does enabling the NMS remove the detection? If the detection is the maximum score the NMS shouldn't remove it.
  2. Why are the scores different between DeepStream's nvinfer plugin and Triton Inference Server?

For completeness:

Then I chose the frame to use (number 84), and I created the single-frame video by doing:

# frames start from 0, that's why we choose 84-1=83
ffmpeg -i original_video.mp4 -vf "select=eq(n\,83)" single_frame_video.mp4

@marcoslucianops have you tried evaluating the engine file outside of DeepStream?

cgrtrifork commented 9 months ago

I ran the following experiment: I am trying out YOLOv8 object detection on an image that contains an object.

1. I used this repository to export the model to onnx. Then using `ffmpeg` I generated a single-frame video, that I feed into DeepStream with a confidence threshold `pre-cluster-threshold=0.2`.

* When I use NMS clustering (`cluster-mode=2`, `nms-iou-threshold=0.5`) the object is _not_ found.

* If I disable the clustering (`cluster-mode=4`) then an object is found with confidence 0.77.

2. I used Triton Inference Server to serve the same TensorRT model that is generated when running DeepStream. Then I ran the inference on the same image. I preprocessed the image to get a 3x640x640 image of float32 between 0 and 1 in RGB format, as it is expected by the model.

* When I use gray background for the padding (pixel value = 114/255) —like YOLO does— the max score of the output is 0.87.

* When I use black background for the padding (pixel value = 0)—like DeepStream does— the max score of the output is 0.90.

Having used the same TensorRT model, this makes me think there is an issue either on the parsing and interpretation of the output from the model, or deeper in a lower level DeepStream preprocessing of the image.

1. Why does enabling the NMS remove the detection? If the detection is the maximum score the NMS shouldn't remove it.

2. Why are the scores different between DeepStream's `nvinfer` plugin and Triton Inference Server?

For completeness:

* The image I'm using was originally extracted from a video by doing:
# extract all the frames from the original video into a folder
# frames are enumerated starting from 1
ffmpeg -i original_video.mp4 original_video/%05d.jpg

Then I chose the frame to use (number 84), and I created the single-frame video by doing:

# frames start from 0, that's why we choose 84-1=83
ffmpeg -i original_video.mp4 -vf "select=eq(n\,83)" single_frame_video.mp4
* The pipeline I'm using in DeepStream is: `nvurisrcbin` -> `videorate` -> `nvvideoconvert` -> `capsfilter` -> `nvstreammux` -> `queue` -> `nvvideoconvert`  -> `capsfilter` -> `nvinfer` -> `fakesink`. I'm adding a probe after the `nvinfer` to see the detections.

@marcoslucianops have you tried evaluating the engine file outside of DeepStream?

Following up on this I found out that the parsing from NvDsInferParseYolo seems to be correct for this case. However, the resulting detection from DeepStream is not the one with the highest confidence. Here you can see the logs from DeepStream —I added print statements to the library:

[Class 0] Box proposal with confidence 0.750208: x1=185.988, y1=141.614, x2=499.038, y2=417.46 (threshold: 0.2)
[Class 0] BBI with confidence 0.750208: left=185.988, top=141.614, width=313.05, height=275.846
[Class 0] Box proposal with confidence 0.881455: x1=184.819, y1=141.771, x2=497.486, y2=416.067 (threshold: 0.2)
[Class 0] BBI with confidence 0.881455: left=184.819, top=141.771, width=312.667, height=274.296
[Class 0] Box proposal with confidence 0.886627: x1=185.479, y1=141.421, x2=499.159, y2=415.547 (threshold: 0.2)
[Class 0] BBI with confidence 0.886627: left=185.479, top=141.421, width=313.68, height=274.127
[Class 0] Box proposal with confidence 0.877862: x1=185.409, y1=141.396, x2=499.173, y2=415.735 (threshold: 0.2)
[Class 0] BBI with confidence 0.877862: left=185.409, top=141.396, width=313.764, height=274.339
[Class 0] Box proposal with confidence 0.866284: x1=184.766, y1=141.94, x2=497.723, y2=416.012 (threshold: 0.2)
[Class 0] BBI with confidence 0.866284: left=184.766, top=141.94, width=312.958, height=274.072
[Class 0] Box proposal with confidence 0.854519: x1=184.601, y1=141.577, x2=499.699, y2=415.742 (threshold: 0.2)
[Class 0] BBI with confidence 0.854519: left=184.601, top=141.577, width=315.097, height=274.165
[Class 0] Box proposal with confidence 0.856617: x1=185.726, y1=141.448, x2=499.246, y2=415.667 (threshold: 0.2)
[Class 0] BBI with confidence 0.856617: left=185.726, top=141.448, width=313.52, height=274.219
[Class 0] Box proposal with confidence 0.770557: x1=184.458, y1=142.046, x2=498.037, y2=416.368 (threshold: 0.2)
[Class 0] BBI with confidence 0.770557: left=184.458, top=142.046, width=313.579, height=274.322
[Class 0] Box proposal with confidence 0.752778: x1=184.416, y1=141.868, x2=499.955, y2=416.512 (threshold: 0.2)
[Class 0] BBI with confidence 0.752778: left=184.416, top=141.868, width=315.539, height=274.644
[Class 0] Box proposal with confidence 0.725658: x1=185.231, y1=141.948, x2=499.762, y2=416.444 (threshold: 0.2)
[Class 0] BBI with confidence 0.725658: left=185.231, top=141.948, width=314.531, height=274.496
[Class 0] Box proposal with confidence 0.23098: x1=184.02, y1=141.643, x2=500.4, y2=416.408 (threshold: 0.2)
[Class 0] BBI with confidence 0.23098: left=184.02, top=141.643, width=316.38, height=274.764
Objects decoded: 11
ObjectList after assignment: 11
2024-02-12 14:21:33,705 [INFO][root]     Frame number: 0
2024-02-12 14:21:33,705 [INFO][root]     [Class 0] Found object with confidence = 0.7705574035644531: left=332.0246276855469, top=0.0827464759349823, width=564.4418334960938, height=494.5528259277344

The DeepStream version I'm using is 6.2, I will test this in newer versions too.

EDIT: It seems to be fixed when upgrading to DeepStream 6.3, now all the detections are found if NMS is disabled, and only the correct maximum confidence detection is found when using NMS.