Open BigMuscle85 opened 6 days ago
👋 Hello @BigMuscle85, thank you for sharing your detailed observations and results regarding YOLOv11 vs SSD performance 🚀!
We appreciate your enthusiastic exploration of the Ultralytics framework and the comprehensive benchmarks you've provided. Diving into on-device performance, especially on platforms like Raspberry Pi and with datasets such as yours, can provide incredibly valuable insights for the community.
If this is a ❓ Question or a request for help with training or custom model implementations, please ensure you've included all relevant details, particularly the full training configuration, logs, and any relevant dataset information.
For 🐛 Bug Reports, please provide a minimum reproducible example to help us replicate and debug the issue — this could include your exact training commands, dataset structure, and any modified configuration files.
Feel free to explore the broader Ultralytics community for more feedback:
Ensure that you’re working with the latest ultralytics
package as updates often bring fixes and optimizations. Upgrade your package as follows:
pip install -U ultralytics
YOLO requires an environment with Python>=3.8 and PyTorch>=1.8. Make sure your environment is set up correctly and up-to-date.
You can run and train YOLO models in several verified environments:
If this badge is green, all Ultralytics CI tests are passing, ensuring the framework is working as expected.
This is an automated response 🤖, and an Ultralytics engineer will assist you further with your query as soon as possible. Thank you for being part of the Ultralytics community 🌟!
Formats other than ONNX get better latencies than ONNX on Raspberry Pi.
https://docs.ultralytics.com/guides/raspberry-pi/
And is the accuracy and confidence you're reporting from Ultralytics inference? Or from your custom pipeline?
When trying to train, the imgsz was set to 640. Previously, I also had dataset images that were only 100 pixels in size, but keeping the setting at 640 gave better results compared to other sizes. You can try and see how the accuracy turns out.
Formats other than ONNX get better latencies than ONNX on Raspberry Pi.
https://docs.ultralytics.com/guides/raspberry-pi/
And is the accuracy and confidence you're reporting from Ultralytics inference? Or from your custom pipeline?
Good point! My results are provided from custom C++ application that uses OpenCV for inference on exported ONNX file. However, I like that Ultralytics training generates images of the result (val_batch0_pred.jpg etc.) and they are very consistent with what I see in the testing C++ application. Also the results for C++ with ONNX are practically the same as Python prediction for "best.pt" model.
When talking about detection speed, I guess that the main difference can be that our original network was trained on resolution of 160x120 but Yolo accepts square size only, thus 160x160 which is about 30% more pixels.
@geoxpert0001: As you can see, I also tested the models with size resized to 320x320 and there is no improvement.
So are you using Letterbox padding for resizing with a border of color (114,114,114)?
What are the hyperparameters you used to train your original network?
So are you using Letterbox padding for resizing with a border of color (114,114,114)?
If Ultralytics does this by default then probably yes. If I set rect=True and set imgsz=[160,120], it makes no difference. I get the following:
WARNING ?? updating to 'imgsz=160'. 'train' and 'val' imgsz must be an integer, while 'predict' and 'export' imgsz may be a [h, w] list or an integer, i.e. 'yolo export imgsz=640,480' or 'yolo export imgsz=640'
Also, when I train the mentioned 1:1 rewrite, I get this
WARNING ?? imgsz=[160] must be multiple of max stride 64, updating to [192]
The solver for original network uses RMSprop optimizer with polynomial decay. When I set optimizer="rmsprop" in Ultralytics, it does not lead to usable results.
type: "RMSProp"
base_lr: 0.0005
max_iter: 500000
lr_policy: "poly"
power: 0.75
batch_size: 128
iter_size: 4
If Ultralytics does this by default then probably yes. If I set rect=True and set imgsz=[160,120], it makes no difference. I get the following:
I meant in your C++ pipeline
Ah, no. Caffe is able to train directly on non-squared images. And, if talking about inference on ONNX then I resize to 160x160.
Are your images square? Because if they're not, you need to apply LetterBox padding like I mentioned in the preprocessing
Ultralytics also has augmentations enabled by default. You can try turning them off because it would probably distort your augmented images by augmenting them again.
Are your images square? Because if they're not, you need to apply LetterBox padding like I mentioned in the preprocessing
If I understand correctly, Ultralytics pads non-square images rather than resizing them to the square size? It may really be one of the problems, because I just tested that predicting on non-resized padded image gives better predictions - confidences are still lower but boxes are now consistent with the original network. Thanks!
It's padding with 114 and resizing with billinear interpolation. You need to use the same preprocessing, which is why I asked whether you're reporting results from the official Ultralytics inference.
The normalization is also different from SSD.
Yes, we use scale 1.0 and subtract mean 115 in our SSD model. Then inference uses
Mat inputBlob = dnn::blobFromImage(image, 1.0, cv::Size(netInputWidth, netInputHeight), cv::Scalar(115), false, false);
For Ultralytics, we use
Mat blob = dnn::blobFromImage(image, 1.0 / 255.0, Size(netInputWidth, netInputHeight), Scalar(), false, false);
as I guess no mean is subtracted and only 1/255 scale is used.
Correctly padding the images with (114,114,114) before resize fixed the problem with misaligned boxes. Weighted mean IoU is now higher. However, confidences are still lower. SEnet-FSSD-160x120 - total detections: 2014, false positives: 6, false negatives: 9, mean wIoU: 0.892 Yolo11n-160 - total detections: 2011, false positives: 30, false negatives: 36, mean wIoU: 0.885
In our SSD model, we completely disabled any data augmentation because we have our own augmented images. I tried disabling augmentation for YOLO11n but the result is much worse. Yolo11n-160-noaugment - total detections: 1934, false positives: 19, false negatives: 102, mean wIoU: 0.854
results = model.train(name=name, data="yolo_dataset/dataset.yaml", imgsz=160, epochs=300, cache="ram", batch=128,
hsv_h=0.0, # hue
hsv_s=0.0, # saturation
hsv_v=0.0, # value
degrees=0.0, # rotation
translate=0.0, # translate
scale=0.0, # scale
shear=0.0, # shear
perspective=0.0, # perspective
flipud=0.0, # flip up-down
fliplr=0.0, # flip left-right
mosaic=False, # mosaic
mixup=False # mixup
)
I also tested "rmsprop" optimizer with same starting learning rate of 0.0005 as for SSD but result is unusable (training loss decreasing but mAP50 is almost zero all the time)
You misunderstood me. I wasn't telling you to resize the photos—you should keep the original 160-pixel size. I was referring to the training parameter, specifically the yolo task=detect mode=train imgsz=640 setting. not imgsz=160 setting
You misunderstood me. I wasn't telling you to resize the photos—you should keep the original 160-pixel size. I was referring to the training parameter, specifically the yolo task=detect mode=train imgsz=640 setting. not imgsz=160 setting
The 320-networks were trained on exactly the same dataset with resolution of 160x120. The only changed parameter was imgsz=320.
@BigMuscle85 thanks for clarifying. For best results with YOLO models, we recommend using the native imgsz
parameter that matches your input resolution (160x120) without upscaling, as forcing higher resolutions can degrade performance. The Ultralytics preprocessing pipeline automatically handles letterboxing and normalization (scale=1/255, no mean subtraction), which differs from SSD's 115 mean subtraction. To improve confidence scores, consider retaining YOLO's default augmentations and optimizer settings, as they're optimized for the architecture. For deployment, TensorRT often provides better latency than ONNX on edge devices.
Thank you. However, the default settings lead to worse result than the original SSD network where it's usual to have confidence of 1.0 while Yolo hardly reaches 0.9. Do I understand correctly that "cls" param should emphasize object classification and thus result in better confidences? Or it influences class identificator only?
This is the result of Yolo11n training with default parameters for 300 epochs with batch size of 128:
I tried implementing Eltwise layer into Ultralytics to have exact copy of our SSD network:
class Add(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return torch.sum(torch.stack(x), dim=0)
I don't understand a few things. Why Ultralytics changes input to size to be a multiple of product of all strides? Included MaxPool2D layers (kernel=3, stride=2) should downsample input size from 160 to 1x1 in last layer.
WARNING ?? imgsz=[160] must be multiple of max stride 64, updating to [192]
Our network is based on SqueezeNet which is intended to be light network with low amount of parameters. Our SSD implementation has 1.92M parameters in total. But Ultralytics implementation results in huge number of parameters:
from n params module arguments
0 -1 1 1856 ultralytics.nn.modules.conv.Conv [3, 64, 3, 2, 0]
1 -1 1 0 torch.nn.modules.pooling.MaxPool2d [3, 2, 0, 1, False, True]
2 -1 1 2112 ultralytics.nn.modules.conv.Conv [64, 32, 1, 1, 0]
3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 1, 1]
4 -1 1 2112 ultralytics.nn.modules.conv.Conv [64, 32, 1, 1, 0]
5 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 1, 1]
6 [-1, -3] 1 0 ultralytics.nn.modules.conv.Add []
7 -1 1 0 torch.nn.modules.pooling.MaxPool2d [3, 2, 1]
8 -1 1 4224 ultralytics.nn.modules.conv.Conv [64, 64, 1, 1, 0]
9 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 1, 1]
10 -1 1 8320 ultralytics.nn.modules.conv.Conv [128, 64, 1, 1, 0]
11 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 1, 2, 1, 2]
12 [-1, -3] 1 0 ultralytics.nn.modules.conv.Add []
13 -1 1 0 torch.nn.modules.pooling.MaxPool2d [3, 2, 0, 1, False, True]
14 -1 1 12480 ultralytics.nn.modules.conv.Conv [128, 96, 1, 1, 0]
15 -1 1 166272 ultralytics.nn.modules.conv.Conv [96, 192, 3, 1, 1]
16 -1 1 18624 ultralytics.nn.modules.conv.Conv [192, 96, 1, 1, 0]
17 -1 1 166272 ultralytics.nn.modules.conv.Conv [96, 192, 3, 1, 1]
18 [-1, -3] 1 0 ultralytics.nn.modules.conv.Add []
19 -1 1 12416 ultralytics.nn.modules.conv.Conv [192, 64, 1, 1, 0]
20 -1 1 147968 ultralytics.nn.modules.conv.Conv [64, 256, 3, 1, 1]
21 -1 1 16512 ultralytics.nn.modules.conv.Conv [256, 64, 1, 1, 0]
22 -1 1 147968 ultralytics.nn.modules.conv.Conv [64, 256, 3, 1, 1]
23 [-1, -3] 1 0 ultralytics.nn.modules.conv.Add []
24 -1 1 0 torch.nn.modules.pooling.MaxPool2d [3, 2, 1]
25 -1 1 24768 ultralytics.nn.modules.conv.Conv [256, 96, 1, 1, 0]
26 -1 1 332544 ultralytics.nn.modules.conv.Conv [96, 384, 3, 1, 1]
27 -1 1 0 torch.nn.modules.pooling.MaxPool2d [3, 2, 0, 1, False, True]
28 -1 1 24704 ultralytics.nn.modules.conv.Conv [384, 64, 1, 1, 0]
29 -1 1 147968 ultralytics.nn.modules.conv.Conv [64, 256, 3, 1, 1]
30 -1 1 0 torch.nn.modules.pooling.AvgPool2d [3, 1, 1, True]
31 23 1 33024 ultralytics.nn.modules.conv.Conv [256, 128, 1, 1, 0]
32 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'bilinear']
33 26 1 49408 ultralytics.nn.modules.conv.Conv [384, 128, 1, 1, 0]
34 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 4, 'bilinear']
35 [12, 32, 34] 1 0 ultralytics.nn.modules.conv.Add []
36 -1 1 256 torch.nn.modules.batchnorm.BatchNorm2d [128]
37 -1 1 8320 ultralytics.nn.modules.conv.Conv [128, 64, 1, 1, 0]
38 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 1, 1]
39[23, 26, 29, 30, 38] 1 6853481 ultralytics.nn.modules.head.Detect [5, [256, 384, 256, 256, 128]]
squeezenet summary: 177 layers, 8,440,681 parameters, 8,440,665 gradients, 27.3 GFLOPs
I trained this network for 1500 epochs with Adam optimizer and lr0=0.0005 but it tends to overfit after about 200 epochs. And results at the best epoch are similar to Yolo11n and not to our original network.
@BigMuscle85 the cls parameter only affects how the classification head is defined and does not boost the confidence scores—in our training, it’s strictly for setting class output. Also, the input dimensions are automatically adjusted to be multiples of the maximum stride to ensure proper downsampling through the network layers; this is necessary to avoid misalignment in feature maps.
Search before asking
Question
Hello, in our previous project, we successfully implemented object detection in images from 160x120 infrared camera on Raspberry Pi 4. We used SqueezeNet-SSD network trained with CaffeSSD framework. Its performance was very good for our needs (about 60 ms per frame) with excellent accuracy on normal-sized objects (confidence mostly over 90%) but lower accuracy on smaller objects (mostly detected correctly with very low confidence 30%).
Later, we stripped SqueezeNet's fire modules to simple squeeze-expand blocks, added feature-fusion for the first SSD layer and modified priorboxes ratios to match our dataset. We reached detection speed of about 30 ms per frame and excellent accuracy for all objects.
In our upcoming project, we are continuing with similar task but we would like to use more innovative approach, because Caffe framework has not been maintained for years anymore. We're experimenting with Ultralytics framework and it looks very modern to us. We're also thinking about switching to Raspberry Pi 5, maybe with Hailo8 kit which is not supported by CaffeSSD so Ultralytics seems to be good way to go.
Our dataset consists of 5000 training grayscale images and 1000 testing images with resolution of 160x120. Many augmented versions of each training image was added to the training dataset thus it has over 40000 images. We identify 5 types of objects - example: face (about 64x100) and eyes (45x10). It's exactly the same dataset that was used for training our SSD networks. Now we have trained several versions of YOLOv11 with batch size of 128 for 300 epochs. Results are good, but not as good as our original SSD network. Here, I would like to share our benchmarks with others:
Detection speed
As you can see, even the nano version of YOLOv11 is much slower than the original SqueezeNet-SSD. Although we would prefer better times, it is still usable for our needs, especially when we're thinking about Hailo8.
Detection accuracy I don't have specific objective statistics here but it is worst just visually. Even yolo11m-320 version provides worst results. Rectangles are not as exact, confidences are lower and there is a bit higher number of false results. Just for illustration on 1000 validation images: (mean wIoU is average of IoU for all detections with threshold of 50 weighted by the confidence)
SEnet-FSSD-160x120 - total detections: 2014, false positives: 6, false negatives: 9, mean wIoU: 0.892 SqueezeNet-SSD - total detections: 2029, false positives: 59, false negatives: 47, mean wIoU: 0.855 Yolo11n-160 - total detections: 2027, false positives: 28, false negatives: 18, mean wIoU: 0.851 Yolo11s-160 - total detections: 2023, false positives: 26, false negatives: 20, mean wIoU: 0.859 Yolo11m-320 - total detections: 2078, false positives: 71, false negatives: 12, mean wIoU: 0.845
Maybe the problem lies in training hyperparameters. We just set batch size to 128 and number of epochs to 300. I would appreciate any ideas. Thank you!
Meanwhile, I've been trying to simulate our SEnet-FSSD using YAML model in Ultralytics. I don't know if it is good idea, I just would like to see if it changes anything. I made pure copy of our network, but it is not possible to train it, because of layers sizes mismatch. MaxPool2d layers don't seem to downscale the resolution in the same way as it happens in Caffe framework. There is also no Eltwise (element sum) layer so I had to change it to Concat layer. Adding padding=1 to all MaxPool2d layers works but it automatically changes the input resolution to 192. But results are practically very similar to other YOLO models and not to our original network.
Here is the model that is 1:1 rewrite of our SSD network. Maybe, someone will be able to fix it:
Additional
EDIT: I forgot to provide information that all detections are done using OpenCV in C++