ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.83k stars 16.13k forks source link

Different results after inference between custom model through PyTorch Hub loader / CLI #2278

Closed mabergerx closed 3 years ago

mabergerx commented 3 years ago

❔Question

Hi, first thank you for this YOLO implementation, looks great!

For an experiment, I retrained the model from the yolov5l.pt checkpoint by following the Train Custom Data tutorial on a relatively small amount of data and just a few classes.

When I am using the CLI for inference on the example images:

! python detect.py --weights ./runs/train/exp6/weights/best.pt --img 640 --conf 0.25 --source "data/images"

a class (wrongly) gets predicted on the bus with a low probability, which is most likely just a data (lack thereof) issue.

Now, when predicting on the same two images by loading the custom model through PyTorch Hub:

from PIL import Image as pil_image
import torch

imgs = [Image.open('data/images/bus.jpg'),  # PIL
        'data/images/zidane.jpg',  # filename
       ]  # numpy

model = torch.hub.load('ultralytics/yolov5', 'custom', path_or_model='./runs/train/exp6/weights/best.pt')

results = model(imgs, size=640)

results.print()

>>> image 1/2: 1080x810  
    image 2/2: 720x1280 1 weapon

or through the pypi Yolov5 package:

from yolov5 import YOLOv5

# set model params
model_path = './runs/train/exp6/weights/best.pt' # it automatically downloads yolov5s model to given path
device = "cuda" # or "cpu"

# init yolov5 model
yolov5 = YOLOv5(model_path, device)

# load images
image1 = Image.open("data/images/bus.jpg")
image2 = Image.open("data/images/zidane.jpg")

results = yolov5.predict([image1, image2], size=640)

results.print()

>>> Image 1/2: 1080x810  
    Image 2/2: 720x1280 1 weapons, 

in both cases we see that no weapon class is getting predicted on the bus image. That is contradicting to the inference results from detect.py CLI. However, even more interesting is that when I do inference on just the single bus image using the above methods, I do get the weapon class predicted!

So the results differ based on the amount of images given to the model to predict. Why is that happening?

And in general, what is the difference between inference using the detect.py on the command line or through loading the model with PyTorch?

We suspect maybe something with BatchNormalization?

Thank you in advance for looking into it!

github-actions[bot] commented 3 years ago

👋 Hello @mabergerx, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

mabergerx commented 3 years ago

Forgot to add, when using the CLI inference through detect.py I see the following:

Fusing layers... 
Model Summary: 392 layers, 46611336 parameters, 0 gradients, 114.1 GFLOPS

While using the PyTorch hub loading (with autoshape), I get the following:

Model Summary: 499 layers, 46642120 parameters, 46642120 gradients, 114.3 GFLOPS

What exactly is happening with the layer fusion? Might that be the explanation?

glenn-jocher commented 3 years ago

@mabergerx detect.py and pytorch hub are different inference pathways. Inference uses the same model (and same forward method), though the pre and post processing differ. This should result in near identical (but not mathematically equal) results between the two, provided you use the same settings (confidence threshold, iou threshold, img-size, etc...)

I don't know of any reason or precedent for different layer counts and FLOPS, this would normally indicate two different models. I will do a full comparison on here to verify.

glenn-jocher commented 3 years ago

detect.py

python detect.py --source data/images --weights yolov5s.pt --conf 0.25

Namespace(agnostic_nms=False, augment=False, classes=None, conf_thres=0.25, device='', exist_ok=False, img_size=640, iou_thres=0.45, name='exp', project='runs/detect', save_conf=False, save_txt=False, source='data/images/', update=False, view_img=False, weights=['yolov5s.pt'])
YOLOv5 v4.0-96-g83dc1b4 torch 1.7.0+cu101 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)

Fusing layers... 
Model Summary: 224 layers, 7266973 parameters, 0 gradients, 17.0 GFLOPS
image 1/2 /content/yolov5/data/images/bus.jpg: 640x480 4 persons, 1 bus, Done. (0.010s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 384x640 2 persons, 1 tie, Done. (0.011s)
Results saved to runs/detect/exp2
Done. (0.103s)

image

PyTorch Hub

Input:

import torch

# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)

# Images
dir = 'https://github.com/ultralytics/yolov5/raw/master/data/images/'
imgs = [dir + f for f in ('zidane.jpg', 'bus.jpg')]  # batched list of images

# Inference
results = model(imgs)
results.print()
results.save()

Output:

Downloading: "https://github.com/ultralytics/yolov5/archive/master.zip" to /root/.cache/torch/hub/master.zip

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  1    156928  models.common.C3                        [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  1    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7276605 parameters, 7276605 gradients

Downloading https://github.com/ultralytics/yolov5/releases/download/v4.0/yolov5s.pt to yolov5s.pt...
100%
14.1M/14.1M [00:01<00:00, 9.47MB/s]

Adding autoShape... 
image 1/2: 720x1280 2 persons, 1 tie
image 2/2: 1080x810 4 persons, 1 bus
Saving results/zidane.jpg, results/bus.jpg, done.

image

glenn-jocher commented 3 years ago

@mabergerx everything looks good!

The pytorch hub model summary displays before layer fusing. Fusing reduces the layer and parameter count.

mabergerx commented 3 years ago

Hmm, interesting. I will test this with yolov5s.pt also on my machine for sanity checking but do you have any intuition about the different inference results between predicting on multiple images versus just one using PyTorch Hub?

glenn-jocher commented 3 years ago

@mabergerx the hub model has a smart batch constructor that will merge disparately sized images into a single-sized batch with a minimum of padding required to meet the constraints of identical height-width sizes throughout the batch and height-width sizes both being multiples of the model max stride (32 typically).

This means that the padding on individual images may vary as a function of the other images in the batch.

In short, single and multi image batches may differ.

mabergerx commented 3 years ago

@glenn-jocher I see, thanks for the explanation. It seems like the size parameter in the Hub call doesn't have any effect then?

Input:

results = model(imgs, size=640)

results.print()

Output (note the resolution):

image 1/1: 1088x832 4 persons, 1 bus

For another test, I tried to first resize the picture and saving it:

image_ = Image.open("data/images/bus.jpg").resize((832, 1088))
image_.save("data/images/resized_bus.jpg")

The inference results are then the following:

Through detect.py

Input:

! python detect.py --weights yolov5l.pt --conf 0.25 --img 1088 --source "data/images/"

Output (_note the same resolution of the bus and resizedbus, but a different prediction?):

Namespace(agnostic_nms=False, augment=False, classes=None, conf_thres=0.25, device='', exist_ok=False, img_size=1088, iou_thres=0.45, name='exp', project='runs/detect', save_conf=False, save_txt=False, source='data/images/', update=False, view_img=False, weights=['yolov5l.pt'])
YOLOv5 v4.0-93-g95aefea torch 1.7.1 CUDA:0 (Tesla P100-PCIE-16GB, 16280.875MB)

Fusing layers... 
Model Summary: 392 layers, 47025981 parameters, 0 gradients, 115.4 GFLOPS
image 1/3 /home/jupyter/yolov5/data/images/bus.jpg: 1088x832 4 persons, 1 bicycle, 1 bus, 1 tie, Done. (0.048s)
image 2/3 /home/jupyter/yolov5/data/images/resized_bus.jpg: 1088x832 4 persons, 1 bicycle, 1 bus, Done. (0.047s)
image 3/3 /home/jupyter/yolov5/data/images/zidane.jpg: 640x1088 2 persons, 2 ties, Done. (0.042s)
Results saved to runs/detect/exp39
Done. (0.248s)

Through PyTorch Hub loader

Input:

model = torch.hub.load('ultralytics/yolov5', 'yolov5l', pretrained=True)

imgs = [
        'data/images/resized_bus.jpg',  # filename
       ]  

results = model(imgs)

results.print()

Output (note the different prediction result given the same size of image and single image batch):

image 1/1: 1088x832 4 persons, 1 bus

Any intuition on why this would be happening beyond the post- and pre-processing differences?

Thank you very much!

EDIT: We found out that the print statement from the PyTorch Hub gave consufing image resolution, after testing it with explicit size setting, we found out that the inference results are consistent.

glenn-jocher commented 3 years ago

@mabergerx size argument defines inference size for the long side of the batch.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.