ultralytics / ultralytics

Ultralytics YOLO11 🚀
https://docs.ultralytics.com
GNU Affero General Public License v3.0
36.33k stars 7k forks source link

YOLOv11 vs SSD performance on 160x120 infrared images #18982

Open BigMuscle85 opened 6 days ago

BigMuscle85 commented 6 days ago

Search before asking

Question

Hello, in our previous project, we successfully implemented object detection in images from 160x120 infrared camera on Raspberry Pi 4. We used SqueezeNet-SSD network trained with CaffeSSD framework. Its performance was very good for our needs (about 60 ms per frame) with excellent accuracy on normal-sized objects (confidence mostly over 90%) but lower accuracy on smaller objects (mostly detected correctly with very low confidence 30%).

Later, we stripped SqueezeNet's fire modules to simple squeeze-expand blocks, added feature-fusion for the first SSD layer and modified priorboxes ratios to match our dataset. We reached detection speed of about 30 ms per frame and excellent accuracy for all objects.

In our upcoming project, we are continuing with similar task but we would like to use more innovative approach, because Caffe framework has not been maintained for years anymore. We're experimenting with Ultralytics framework and it looks very modern to us. We're also thinking about switching to Raspberry Pi 5, maybe with Hailo8 kit which is not supported by CaffeSSD so Ultralytics seems to be good way to go.

Our dataset consists of 5000 training grayscale images and 1000 testing images with resolution of 160x120. Many augmented versions of each training image was added to the training dataset thus it has over 40000 images. We identify 5 types of objects - example: face (about 64x100) and eyes (45x10). It's exactly the same dataset that was used for training our SSD networks. Now we have trained several versions of YOLOv11 with batch size of 128 for 300 epochs. Results are good, but not as good as our original SSD network. Here, I would like to share our benchmarks with others:

Detection speed

            RPi5        Rock 4B+    RPi4        RPi 5 + Hailo 8
----------------------------------------------------------------------------------------------------                        
SEnet-FSSD-160x120        7.542 ms  27.478 ms   29.263 ms   -
SqueezeNet-SSD       10.074 ms  32.615 ms   38.491 ms   -
Yolo11n-160      12.317 ms  49.212 ms   45.283 ms   4.252 ms
Yolo11n-320      48.207 ms  177.076 ms  178.268 ms  7.236 ms
Yolo11s-160      30.835 ms  129.767 ms  127.677 ms  10.999 ms
Yolo11m-320     313.738 ms  1121.319 ms 1180.839 ms 24.829 ms

As you can see, even the nano version of YOLOv11 is much slower than the original SqueezeNet-SSD. Although we would prefer better times, it is still usable for our needs, especially when we're thinking about Hailo8.

Detection accuracy I don't have specific objective statistics here but it is worst just visually. Even yolo11m-320 version provides worst results. Rectangles are not as exact, confidences are lower and there is a bit higher number of false results. Just for illustration on 1000 validation images: (mean wIoU is average of IoU for all detections with threshold of 50 weighted by the confidence)

SEnet-FSSD-160x120 - total detections: 2014, false positives: 6, false negatives: 9, mean wIoU: 0.892 SqueezeNet-SSD - total detections: 2029, false positives: 59, false negatives: 47, mean wIoU: 0.855 Yolo11n-160 - total detections: 2027, false positives: 28, false negatives: 18, mean wIoU: 0.851 Yolo11s-160 - total detections: 2023, false positives: 26, false negatives: 20, mean wIoU: 0.859 Yolo11m-320 - total detections: 2078, false positives: 71, false negatives: 12, mean wIoU: 0.845

$ ./test image.png senet-fssd-160x120
0: class = 3, confidence = 1.000000, [61, 10, 124, 107]
1: class = 1, confidence = 0.999990, [72, 49, 115, 57]

$ ./test image.png squeezenet-ssd-160x120
0: class = 3, confidence = 1.000000, [61, 10, 123, 108]
1: class = 1, confidence = 0.774772, [72, 49, 114, 57]

$ ./test image.png yolo11n-160.onnx
0: class = 3, confidence = 0.920182, [60, 11, 123, 99]
1: class = 1, confidence = 0.766865, [71, 49, 115, 57]

$ ./test image.png yolo11m-320.onnx
0: class = 3, confidence = 0.895741, [61, 12, 123, 103]
1: class = 1, confidence = 0.745349, [72, 50, 115, 56]

Maybe the problem lies in training hyperparameters. We just set batch size to 128 and number of epochs to 300. I would appreciate any ideas. Thank you!

Meanwhile, I've been trying to simulate our SEnet-FSSD using YAML model in Ultralytics. I don't know if it is good idea, I just would like to see if it changes anything. I made pure copy of our network, but it is not possible to train it, because of layers sizes mismatch. MaxPool2d layers don't seem to downscale the resolution in the same way as it happens in Caffe framework. There is also no Eltwise (element sum) layer so I had to change it to Concat layer. Adding padding=1 to all MaxPool2d layers works but it automatically changes the input resolution to 192. But results are practically very similar to other YOLO models and not to our original network.

Here is the model that is 1:1 rewrite of our SSD network. Maybe, someone will be able to fix it:

nc: 5
activation: nn.ReLU6()

backbone:
  - [-1, 1, Conv, [64, 3, 2, 0]] # 0,conv1
  - [-1, 1, nn.MaxPool2d, [3, 2]] # 1,pool1

  - [-1, 1, Conv, [32, 1, 1, 0] ]  # 2,fire2 squeeze
  - [-1, 1, Conv, [64, 3, 1, 1] ]  # 3,fire2 expand
  - [-1, 1, Conv, [32, 1, 1, 0] ]  # 4,fire3 squeeze
  - [-1, 1, Conv, [64, 3, 1, 1] ]  # 5,fire3 expand

  - [-1, 1, nn.MaxPool2d, [3, 2]] # 6,pool3

  - [-1, 1, Conv, [64, 1, 1, 0] ]  # 7,fire4 squeeze
  - [-1, 1, Conv, [128, 3, 1, 1] ]  # 8,fire4 expand
  - [-1, 1, Conv, [64, 1, 1, 0] ]  # 9,fire5 squeeze
  - [-1, 1, Conv, [128, 3, 1, 1] ]  # 10,fire5 expand

  - [-1, 1, nn.MaxPool2d, [3, 2]] # 11,pool5

  - [-1, 1, Conv, [96, 1, 1, 0] ]  # 12,fire6 squeeze
  - [-1, 1, Conv, [192, 3, 1, 1] ]  # 13,fire6 expand
  - [-1, 1, Conv, [96, 1, 1, 0] ]  # 14,fire7 squeeze
  - [-1, 1, Conv, [192, 3, 1, 1] ]  # 15,fire7 expand

  - [-1, 1, Conv, [64, 1, 1, 0] ]  # 16,fire8 squeeze
  - [-1, 1, Conv, [256, 3, 1, 1] ]  # 17,fire8 expand
  - [-1, 1, Conv, [64, 1, 1, 0] ]  # 18,fire9 squeeze
  - [-1, 1, Conv, [256, 3, 1, 1] ]  # 19,fire9 expand        

head:

  - [-1, 1, nn.MaxPool2d, [3, 2]] # 20,pool9

  - [-1, 1, Conv, [96, 1, 1, 0] ]  # 21,fire10 squeeze
  - [-1, 1, Conv, [384, 3, 1, 1] ]  # 22,fire10 expand
  - [-1, 1, nn.MaxPool2d, [3, 2]] # 23,pool10

  - [-1, 1, Conv, [64, 1, 1, 0] ]  # 24,fire11 squeeze
  - [-1, 1, Conv, [256, 3, 1, 1] ]  # 25,fire11 expand

  #  feature-fusion layers
  - [19, 1, Conv, [128, 1, 1, 0] ]  # 26
  - [-1, 1, nn.Upsample, [None, 2, "nearest"]] # 27

  - [22, 1, Conv, [128, 1, 1, 0] ]  # 28
  - [-1, 1, nn.Upsample, [None, 4, "nearest"]] # 29

  - [[29, 27, 10], 1, Concat, [1]] # 30
  - [-1, 1, nn.BatchNorm2d, []] # 31
  - [-1, 1, Conv, [64, 1, 1, 0] ]  # 32
  - [-1, 1, Conv, [128, 3, 1, 1] ]  # 33

  - [[25, 22, 19, 33], 1, Detect, [nc]]

Additional

EDIT: I forgot to provide information that all detections are done using OpenCV in C++

UltralyticsAssistant commented 6 days ago

👋 Hello @BigMuscle85, thank you for sharing your detailed observations and results regarding YOLOv11 vs SSD performance 🚀!

We appreciate your enthusiastic exploration of the Ultralytics framework and the comprehensive benchmarks you've provided. Diving into on-device performance, especially on platforms like Raspberry Pi and with datasets such as yours, can provide incredibly valuable insights for the community.

If this is a ❓ Question or a request for help with training or custom model implementations, please ensure you've included all relevant details, particularly the full training configuration, logs, and any relevant dataset information.

For 🐛 Bug Reports, please provide a minimum reproducible example to help us replicate and debug the issue — this could include your exact training commands, dataset structure, and any modified configuration files.

Some Suggestions:

  1. Hyperparameters: Adjusting training hyperparameters (e.g., learning rate, image sizes, augmentation settings, etc.) might help improve YOLO's accuracy in your case. I recommend checking out our Tips for Best Training Results.
  2. Custom Architectures: Regarding your YAML implementation of SSD, some manual modifications may be necessary to address layer size mismatches. For guidance on building custom models within Ultralytics, you can refer to our Documentation.
  3. Speed Optimization: Performance on embedded devices can vary, and using the correct export options (e.g., ONNX or TensorRT) or accelerator libraries could help improve your speeds further. Please share more details if you are optimizing for a specific inference runtime.

Feel free to explore the broader Ultralytics community for more feedback:


Upgrade Instructions

Ensure that you’re working with the latest ultralytics package as updates often bring fixes and optimizations. Upgrade your package as follows:

pip install -U ultralytics

YOLO requires an environment with Python>=3.8 and PyTorch>=1.8. Make sure your environment is set up correctly and up-to-date.

Verified Environments

You can run and train YOLO models in several verified environments:


Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are passing, ensuring the framework is working as expected.


This is an automated response 🤖, and an Ultralytics engineer will assist you further with your query as soon as possible. Thank you for being part of the Ultralytics community 🌟!

Y-T-G commented 6 days ago

Formats other than ONNX get better latencies than ONNX on Raspberry Pi.

https://docs.ultralytics.com/guides/raspberry-pi/

And is the accuracy and confidence you're reporting from Ultralytics inference? Or from your custom pipeline?

geoxpert0001 commented 6 days ago

When trying to train, the imgsz was set to 640. Previously, I also had dataset images that were only 100 pixels in size, but keeping the setting at 640 gave better results compared to other sizes. You can try and see how the accuracy turns out.

BigMuscle85 commented 6 days ago

Formats other than ONNX get better latencies than ONNX on Raspberry Pi.

https://docs.ultralytics.com/guides/raspberry-pi/

And is the accuracy and confidence you're reporting from Ultralytics inference? Or from your custom pipeline?

Good point! My results are provided from custom C++ application that uses OpenCV for inference on exported ONNX file. However, I like that Ultralytics training generates images of the result (val_batch0_pred.jpg etc.) and they are very consistent with what I see in the testing C++ application. Also the results for C++ with ONNX are practically the same as Python prediction for "best.pt" model.

When talking about detection speed, I guess that the main difference can be that our original network was trained on resolution of 160x120 but Yolo accepts square size only, thus 160x160 which is about 30% more pixels.

@geoxpert0001: As you can see, I also tested the models with size resized to 320x320 and there is no improvement.

Y-T-G commented 6 days ago

So are you using Letterbox padding for resizing with a border of color (114,114,114)?

Y-T-G commented 6 days ago

What are the hyperparameters you used to train your original network?

BigMuscle85 commented 6 days ago

So are you using Letterbox padding for resizing with a border of color (114,114,114)?

If Ultralytics does this by default then probably yes. If I set rect=True and set imgsz=[160,120], it makes no difference. I get the following:

WARNING ?? updating to 'imgsz=160'. 'train' and 'val' imgsz must be an integer, while 'predict' and 'export' imgsz may be a [h, w] list or an integer, i.e. 'yolo export imgsz=640,480' or 'yolo export imgsz=640'

Also, when I train the mentioned 1:1 rewrite, I get this

WARNING ?? imgsz=[160] must be multiple of max stride 64, updating to [192]

The solver for original network uses RMSprop optimizer with polynomial decay. When I set optimizer="rmsprop" in Ultralytics, it does not lead to usable results.

type: "RMSProp"
base_lr: 0.0005
max_iter: 500000
lr_policy: "poly"
power: 0.75
batch_size: 128
iter_size: 4
Y-T-G commented 6 days ago

If Ultralytics does this by default then probably yes. If I set rect=True and set imgsz=[160,120], it makes no difference. I get the following:

I meant in your C++ pipeline

BigMuscle85 commented 6 days ago

Ah, no. Caffe is able to train directly on non-squared images. And, if talking about inference on ONNX then I resize to 160x160.

Y-T-G commented 6 days ago

Are your images square? Because if they're not, you need to apply LetterBox padding like I mentioned in the preprocessing

Y-T-G commented 6 days ago

Ultralytics also has augmentations enabled by default. You can try turning them off because it would probably distort your augmented images by augmenting them again.

BigMuscle85 commented 6 days ago

Are your images square? Because if they're not, you need to apply LetterBox padding like I mentioned in the preprocessing

If I understand correctly, Ultralytics pads non-square images rather than resizing them to the square size? It may really be one of the problems, because I just tested that predicting on non-resized padded image gives better predictions - confidences are still lower but boxes are now consistent with the original network. Thanks!

Y-T-G commented 6 days ago

It's padding with 114 and resizing with billinear interpolation. You need to use the same preprocessing, which is why I asked whether you're reporting results from the official Ultralytics inference.

The normalization is also different from SSD.

BigMuscle85 commented 5 days ago

Yes, we use scale 1.0 and subtract mean 115 in our SSD model. Then inference uses Mat inputBlob = dnn::blobFromImage(image, 1.0, cv::Size(netInputWidth, netInputHeight), cv::Scalar(115), false, false);

For Ultralytics, we use Mat blob = dnn::blobFromImage(image, 1.0 / 255.0, Size(netInputWidth, netInputHeight), Scalar(), false, false); as I guess no mean is subtracted and only 1/255 scale is used.

Correctly padding the images with (114,114,114) before resize fixed the problem with misaligned boxes. Weighted mean IoU is now higher. However, confidences are still lower. SEnet-FSSD-160x120 - total detections: 2014, false positives: 6, false negatives: 9, mean wIoU: 0.892 Yolo11n-160 - total detections: 2011, false positives: 30, false negatives: 36, mean wIoU: 0.885

In our SSD model, we completely disabled any data augmentation because we have our own augmented images. I tried disabling augmentation for YOLO11n but the result is much worse. Yolo11n-160-noaugment - total detections: 1934, false positives: 19, false negatives: 102, mean wIoU: 0.854

results = model.train(name=name, data="yolo_dataset/dataset.yaml", imgsz=160, epochs=300, cache="ram", batch=128, 
    hsv_h=0.0,  # hue
    hsv_s=0.0,  # saturation
    hsv_v=0.0,  # value
    degrees=0.0,  # rotation
    translate=0.0,  # translate
    scale=0.0,  # scale
    shear=0.0,  # shear
    perspective=0.0,  # perspective
    flipud=0.0,  # flip up-down
    fliplr=0.0,  # flip left-right
    mosaic=False,  # mosaic
    mixup=False  # mixup                      
)

I also tested "rmsprop" optimizer with same starting learning rate of 0.0005 as for SSD but result is unusable (training loss decreasing but mAP50 is almost zero all the time)

geoxpert0001 commented 5 days ago

You misunderstood me. I wasn't telling you to resize the photos—you should keep the original 160-pixel size. I was referring to the training parameter, specifically the yolo task=detect mode=train imgsz=640 setting. not imgsz=160 setting

BigMuscle85 commented 5 days ago

You misunderstood me. I wasn't telling you to resize the photos—you should keep the original 160-pixel size. I was referring to the training parameter, specifically the yolo task=detect mode=train imgsz=640 setting. not imgsz=160 setting

The 320-networks were trained on exactly the same dataset with resolution of 160x120. The only changed parameter was imgsz=320.

glenn-jocher commented 4 days ago

@BigMuscle85 thanks for clarifying. For best results with YOLO models, we recommend using the native imgsz parameter that matches your input resolution (160x120) without upscaling, as forcing higher resolutions can degrade performance. The Ultralytics preprocessing pipeline automatically handles letterboxing and normalization (scale=1/255, no mean subtraction), which differs from SSD's 115 mean subtraction. To improve confidence scores, consider retaining YOLO's default augmentations and optimizer settings, as they're optimized for the architecture. For deployment, TensorRT often provides better latency than ONNX on edge devices.

BigMuscle85 commented 3 days ago

Thank you. However, the default settings lead to worse result than the original SSD network where it's usual to have confidence of 1.0 while Yolo hardly reaches 0.9. Do I understand correctly that "cls" param should emphasize object classification and thus result in better confidences? Or it influences class identificator only?

This is the result of Yolo11n training with default parameters for 300 epochs with batch size of 128:

Image

I tried implementing Eltwise layer into Ultralytics to have exact copy of our SSD network:

class Add(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.sum(torch.stack(x), dim=0)     

I don't understand a few things. Why Ultralytics changes input to size to be a multiple of product of all strides? Included MaxPool2D layers (kernel=3, stride=2) should downsample input size from 160 to 1x1 in last layer.

WARNING ?? imgsz=[160] must be multiple of max stride 64, updating to [192]

Our network is based on SqueezeNet which is intended to be light network with low amount of parameters. Our SSD implementation has 1.92M parameters in total. But Ultralytics implementation results in huge number of parameters:

                   from  n    params  module                                       arguments                     
  0                  -1  1      1856  ultralytics.nn.modules.conv.Conv             [3, 64, 3, 2, 0]              
  1                  -1  1         0  torch.nn.modules.pooling.MaxPool2d           [3, 2, 0, 1, False, True]     
  2                  -1  1      2112  ultralytics.nn.modules.conv.Conv             [64, 32, 1, 1, 0]             
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 1, 1]             
  4                  -1  1      2112  ultralytics.nn.modules.conv.Conv             [64, 32, 1, 1, 0]             
  5                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 1, 1]             
  6            [-1, -3]  1         0  ultralytics.nn.modules.conv.Add              []                            
  7                  -1  1         0  torch.nn.modules.pooling.MaxPool2d           [3, 2, 1]                     
  8                  -1  1      4224  ultralytics.nn.modules.conv.Conv             [64, 64, 1, 1, 0]             
  9                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 1, 1]            
 10                  -1  1      8320  ultralytics.nn.modules.conv.Conv             [128, 64, 1, 1, 0]            
 11                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 1, 2, 1, 2]      
 12            [-1, -3]  1         0  ultralytics.nn.modules.conv.Add              []                            
 13                  -1  1         0  torch.nn.modules.pooling.MaxPool2d           [3, 2, 0, 1, False, True]     
 14                  -1  1     12480  ultralytics.nn.modules.conv.Conv             [128, 96, 1, 1, 0]            
 15                  -1  1    166272  ultralytics.nn.modules.conv.Conv             [96, 192, 3, 1, 1]            
 16                  -1  1     18624  ultralytics.nn.modules.conv.Conv             [192, 96, 1, 1, 0]            
 17                  -1  1    166272  ultralytics.nn.modules.conv.Conv             [96, 192, 3, 1, 1]            
 18            [-1, -3]  1         0  ultralytics.nn.modules.conv.Add              []                            
 19                  -1  1     12416  ultralytics.nn.modules.conv.Conv             [192, 64, 1, 1, 0]            
 20                  -1  1    147968  ultralytics.nn.modules.conv.Conv             [64, 256, 3, 1, 1]            
 21                  -1  1     16512  ultralytics.nn.modules.conv.Conv             [256, 64, 1, 1, 0]            
 22                  -1  1    147968  ultralytics.nn.modules.conv.Conv             [64, 256, 3, 1, 1]            
 23            [-1, -3]  1         0  ultralytics.nn.modules.conv.Add              []                            
 24                  -1  1         0  torch.nn.modules.pooling.MaxPool2d           [3, 2, 1]                     
 25                  -1  1     24768  ultralytics.nn.modules.conv.Conv             [256, 96, 1, 1, 0]            
 26                  -1  1    332544  ultralytics.nn.modules.conv.Conv             [96, 384, 3, 1, 1]            
 27                  -1  1         0  torch.nn.modules.pooling.MaxPool2d           [3, 2, 0, 1, False, True]     
 28                  -1  1     24704  ultralytics.nn.modules.conv.Conv             [384, 64, 1, 1, 0]            
 29                  -1  1    147968  ultralytics.nn.modules.conv.Conv             [64, 256, 3, 1, 1]            
 30                  -1  1         0  torch.nn.modules.pooling.AvgPool2d           [3, 1, 1, True]               
 31                  23  1     33024  ultralytics.nn.modules.conv.Conv             [256, 128, 1, 1, 0]           
 32                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'bilinear']         
 33                  26  1     49408  ultralytics.nn.modules.conv.Conv             [384, 128, 1, 1, 0]           
 34                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 4, 'bilinear']         
 35        [12, 32, 34]  1         0  ultralytics.nn.modules.conv.Add              []                            
 36                  -1  1       256  torch.nn.modules.batchnorm.BatchNorm2d       [128]                         
 37                  -1  1      8320  ultralytics.nn.modules.conv.Conv             [128, 64, 1, 1, 0]            
 38                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 1, 1]            
 39[23, 26, 29, 30, 38]  1   6853481  ultralytics.nn.modules.head.Detect           [5, [256, 384, 256, 256, 128]]
squeezenet summary: 177 layers, 8,440,681 parameters, 8,440,665 gradients, 27.3 GFLOPs

I trained this network for 1500 epochs with Adam optimizer and lr0=0.0005 but it tends to overfit after about 200 epochs. And results at the best epoch are similar to Yolo11n and not to our original network.

Image

glenn-jocher commented 3 days ago

@BigMuscle85 the cls parameter only affects how the classification head is defined and does not boost the confidence scores—in our training, it’s strictly for setting class output. Also, the input dimensions are automatically adjusted to be multiples of the maximum stride to ensure proper downsampling through the network layers; this is necessary to avoid misalignment in feature maps.