ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.23k stars 16.44k forks source link

How to see tensor shape? #1277

Closed a-esp-1 closed 3 years ago

a-esp-1 commented 4 years ago

❔Question

Hi,

I would like to see tensor shape as mentioned in #1015. I tried to print pred variable at the end of detect.py but I get lots of values, not just 3, as in the issue mentioned.

In blogs, I can see the format [13,13,255], so I would like to see something similar (changing the number).

Thank you in advance.

Additional context

glenn-jocher commented 4 years ago

@a-esp-1 print inference output shape:

print(model(img)[0].shape)
torch.Size([1, 18900, 85])
a-esp-1 commented 4 years ago

@glenn-jocher thank you so much!

I have another cuestion. In blogs I can see that the output tensor is [13,13,255]. This means that a 13x13 grid is used and 3*(4+1+80) = 255 (3 detections, (x,y,w,h) and 80 classes), so how can be [1,6172047,11] interpreted (my output)?

I use 6 clases, so 11 (4+1+6=11) but I don't understand the first two numbers.

My img-size is 10016.

Thank you in advance.

a-esp-1 commented 4 years ago

Ok I think I already found the solution to my question.

If I use a 416 img-size I get: torch.Size([1, 10647,11]).

11 = 6 clases + 4 (x,y,w,h) + 1 (score). With a image size of 416 the grids used are 13x13, 26x26 and 52x52. So 10647 = (13x13 + 26x26 + 52x52) x 3 (three scales)

glenn-jocher commented 4 years ago

@a-esp-1 yes that's correct. YOLOv5 default --img-size is 640, not 416. You divide 640 by the P3-P5 output strides of 8, 16, 32 to arrive at grid sizes.

a-esp-1 commented 4 years ago

@glenn-jocher thank you so much!

heeda88 commented 4 years ago

Thanks you, This Q&A is helpful to me.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Svetlana-art commented 3 years ago

Hi, my input image shape is (16, 3, 512, 512) my output image shape is (torch.Size([16, 16128, 85]). I use only one class for detection. Please help me to understand how can I interpreted my output shape?

glenn-jocher commented 3 years ago

@Svetlana-art the default models have 3 outputs, P3, P4, P5. You divide your image size, i.e. 640 by the P3-P5 output strides of 8, 16, 32 to arrive at grid sizes of 80x80, 40x40, 20x20. Each grid point has 3 anchors by default, and each anchor contains a vector 5+nc long, where nc is the number of classes the model has. Therefore the default model with 80 classes has 1x3x80x80x85 + 1x3x40x40x85 + 1x3x20x20x85 output points. These are flattened and reshaped into the output shape you see (16128x85), and you have 16 images in your first dimension.

ithmz commented 3 years ago

Hi @glenn-jocher, print(model(img)[0].shape) torch.Size([1, 18900, 85])

Should it be 25200 instead of 18900? Can you elaborate? Thanks

glenn-jocher commented 3 years ago

@tsangz189 Output shape is a function of input shape.

iaverypadberg commented 2 years ago

@glenn-jocher is it a bad idea to reshape the flattened output so that it plays nicely with other libraries that expect shapes like 1x3x6300x85? Or is there a way to modify the input model to achieve this?

glenn-jocher commented 2 years ago

@iaverypadberg 👋 Hello! Thanks for asking about handling inference results. You an use detect.py and PyTorch Hub for inference of trained YOLOv5 models. Code customization is outside the scope of our support.

YOLOv5 🚀 PyTorch Hub models allow for simple model loading and inference in a pure python environment without using detect.py.

Simple Inference Example

This example loads a pretrained YOLOv5s model from PyTorch Hub as model and passes an image for inference. 'yolov5s' is the YOLOv5 'small' model. For details on all available models please see the README. Custom models can also be loaded, including custom trained PyTorch models and their exported variants, i.e. ONNX, TensorRT, TensorFlow, OpenVINO YOLOv5 models.

import torch

# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')  # or yolov5m, yolov5l, yolov5x, etc.
# model = torch.hub.load('ultralytics/yolov5', 'custom', 'path/to/best.pt')  # custom trained model

# Images
im = 'https://ultralytics.com/images/zidane.jpg'  # or file, Path, URL, PIL, OpenCV, numpy, list

# Inference
results = model(im)

# Results
results.print()  # or .show(), .save(), .crop(), .pandas(), etc.

results.xyxy[0]  # im predictions (tensor)
results.pandas().xyxy[0]  # im predictions (pandas)
#      xmin    ymin    xmax   ymax  confidence  class    name
# 0  749.50   43.50  1148.0  704.5    0.874023      0  person
# 2  114.75  195.75  1095.0  708.0    0.624512      0  person
# 3  986.00  304.00  1028.0  420.0    0.286865     27     tie

See YOLOv5 PyTorch Hub Tutorial for details.

Good luck 🍀 and let us know if you have any other questions!

spacewalk01 commented 2 years ago

@glenn-jocher I wonder the order of sequence [85] is the same as this one?

image

reference: https://www.researchgate.net/publication/349929458_Bachelor_Thesis_Development_and_Deployment_of_a_Perception_Stack_for_the_Formula_Student_Driverless_Competition/figures?lo=1

glenn-jocher commented 2 years ago

@spacewalk01 no. The first 5 are xywh objectness

marcpfuller commented 1 year ago

@glenn-jocher, Hi, i think this may have been answered but I am very new to the ML CV space. any help would be appreciated. my input dim is [1,3,416,416], and I am using openvino modelserver + gocv +grpc, The output is sent back at [][]byte and I am trying to understand the output. openvino model server says my output shapes are [1, 255, 52,52], [1,255,26,26], [1,255,13,13] respectively. Is this correct? I am trying to convert the [][]byte to a matrix and lost as to what values mean.... Help :)

glenn-jocher commented 1 year ago

@marcpfuller Yes, the returned YOLOv5 output tensor shapes correspond to the size of the feature maps (i.e., grid sizes) at detection scales P3, P4, and P5.

As a general rule, each detection scale has the same shape in each dimension, where the spatial resolution decreases and the features increase as the detection scale increases. In the original paper (link: https://arxiv.org/abs/2004.10934), it is explained that the grid sizes are [13, 26, 52] for P3, P4, and P5 detection scales, respectively, and that there are 3 anchors for each grid point.

You can think of each anchor containing 85 values, which are the net outputs for that anchor in the form of [4 bbox coords (center_x, center_y, width, height), 1 objectness score, and 80 class probabilities].

In the flattend tensor of shape (1, 255x52x52), the attribute values (i.e., "classes") are interleaved with the bounding boxes and the objectness score for each anchor. To manipulate and make sense of the data you will have to use a reshape function to transform it into a higher-dimension tensor with the dimensions you need for your application.

I hope this helps you to understand the output of the YOLOv5 model better. Let us know if you have any further questions or doubts.

marcpfuller commented 1 year ago

@glenn-jocher, Hi thank you for the help, I was able to parse the the [][]byte output from openvino model server via the k-serve api + golang + gocv.

Let me explain what I did for those who are looking to do the same thing. model server puts each 3 tensor output into a [][]byte, that is, each []byte is the tensor output data. Use gocv.NewMatWithSizesfromBytes(), to reshape the output to [1,3,52,52,85], [1,3,26,25,85] and [1,3,13,13,85] for each tensor output respectively. Now you you only need to loop through the matrix! I hope this helps someone. I spent to many sleepless night trying to figure this out.....

for i := 0; i < tensorHeight; i++ {
        for j := 0; j < tensorWidth; j++ {
            for k := 0; k < 3; k++ {

                Index := (i*tensorWidth+j)*85*3 + k*85
            }
        }
}

I hope this helps

glenn-jocher commented 1 year ago

@marcpfuller thank you for sharing your solution! I'm glad to hear that you were able to parse the YOLOv5 output using the k-serve API, golang and gocv. Your solution will be helpful for others who are also working with OpenVINO model server and Go. Thank you for sharing your code snippet, and feel free to reach out with any further questions or concerns.

marcpfuller commented 1 year ago

@glenn-jocher, one more question, if you dont mind. the values I see in the matrix are supposed to be (center_x, center_y, width, height), 1 objectness score, and 80 class probabilities], right ? the center_x, center_y, width, height values seem to be really small or big (1.1234), some are even negative (-0.1234). is there some normalization I need to do to fit the boxes on to the image? if so, could you point me in the direction of how to do these calculations for normalization? I would like to get the x1y1 , x2y2 coordinates

glenn-jocher commented 1 year ago

@marcpfuller The 1st 4 values from each output position ([center_x, center_y, width, height]) are normalized relative to the image width and height. Typically, they are scaled to the range [0, 1] representing the fraction of the full image width or height.

To convert them to pixel coordinates, you would unnormalize the values by multiplying the center_x and width by the image width, and the center_y and height by the image height. Then you would subtract or add half the width or height respectively to obtain the x1y1, x2y2 coordinates of the bounding box.

Here's how you can do it:

// Assuming your output tensor is `output` and image width and height are `imgWidth` and `imgHeight` respectively
center_x := output[0] * imgWidth // Unnormalize center_x
center_y := output[1] * imgHeight // Unnormalize center_y
width := output[2] * imgWidth  // Unnormalize width
height := output[3] * imgHeight // Unnormalize height

x1 := center_x - width/2  // Calculate x1
y1 := center_y - height/2  // Calculate y1
x2 := center_x + width/2  // Calculate x2
y2 := center_y + height/2  // Calculate y2

This will provide you with the unnormalized coordinates of the bounding box. You can then use these coordinates to visualize or use the bounding boxes as needed. Let me know if you have any other questions or need further assistance!