ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.39k stars 16.26k forks source link

Understanding SPP and SPPF implementation #8785

Closed UnglvKitDe closed 2 years ago

UnglvKitDe commented 2 years ago

Search before asking

Question

Hey, I am having trouble understanding the implementation of SPP/SPPF. In the original SPPNet-Paper, the idea was that you get the same output size with different image sizes. But this is not the case with our implementation. The output scales with the image size, which contradicts the original idea. In the original paper the idea was that you could use a FC layer after that. We will use a conv layer, but still. So why is it called SPP? Or am I misunderstanding something? Thank you :)

Additional

Small Example:

from models.common import SPFF
A = torch.ones((1, 64, 16, 16))
B = torch.ones((1, 64, 8, 8))
model = SPPF(64, 8)

out_A = model(A)
out_B = model(B)
out_A.shape, out_B.shape
>>(torch.Size([1, 8, 16, 16]), torch.Size([1, 8, 8, 8]))
glenn-jocher commented 2 years ago

@UnglvKitDe SPP is implemented in YOLOv3 originally by Joseph Redmon (adopted from the Spatial Pyramid Pooling paper). Our SPP module is exactly the same. SPPF is an optimized version of SPP I created myself that is mathematically identical with less FLOPs.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

harshdhamecha commented 1 year ago

I am also having hard time understanding SPP(F). @glenn-jocher can you please shed more light on this? Like, what is the exact purpose and intuition behind using this SPP(F)?

cusacrt1 commented 1 year ago

From my understanding, SPP is not used in the traditional sense of yolov5. It is rather a concatenation of layers repeatedly pooled. Each pooled layer is a more coarse-grained representation of the layer previously. Maybe this allows the model to focus in on certain features?

With regard to arbitrary input sizes and a fixed output, it looks like yolov5 builds the target labels based on the output of the detector, so there is no dimension mismatching when training models with different image sizes.

glenn-jocher commented 1 year ago

@cusacrt1 although SPP (Spatial Pyramid Pooling) is originally used in YOLOv3, its implementation in YOLOv5 is slightly different. In YOLOv5, SPP is used as an additional module called SPPF (Spatial Pyramid Pooling Fusion), which is an optimized version with the same mathematical functionality but fewer floating-point operations (FLOPs).

The purpose of SPPF is to provide a multi-scale representation of the input feature maps. By pooling at different scales, SPPF allows the model to capture features at various levels of abstraction. This can be particularly useful in object detection, where objects of different sizes may need to be detected.

In terms of arbitrary input sizes, YOLOv5 builds the target labels based on the output of the detector. This means that the model can handle different image sizes without causing dimension mismatch during training.

I hope this clarifies the purpose and usage of SPPF in YOLOv5. Feel free to ask further questions if anything is still unclear!

belafdil-chakib commented 8 months ago

@cusacrt1 although SPP (Spatial Pyramid Pooling) is originally used in YOLOv3, its implementation in YOLOv5 is slightly different. In YOLOv5, SPP is used as an additional module called SPPF (Spatial Pyramid Pooling Fusion), which is an optimized version with the same mathematical functionality but fewer floating-point operations (FLOPs).

The purpose of SPPF is to provide a multi-scale representation of the input feature maps. By pooling at different scales, SPPF allows the model to capture features at various levels of abstraction. This can be particularly useful in object detection, where objects of different sizes may need to be detected.

In terms of arbitrary input sizes, YOLOv5 builds the target labels based on the output of the detector. This means that the model can handle different image sizes without causing dimension mismatch during training.

I hope this clarifies the purpose and usage of SPPF in YOLOv5. Feel free to ask further questions if anything is still unclear!

@glenn-jocher : As the author mentioned in the original SPPNet paper, the output of an SPP block remains constant regardless of the size of the original image. In the paper, the stride is directly proportional to the image resolution, enabling the extraction of vectorized features at a fixed size (4096 in the paper). However, your implementation lacks this specific feature. While one can attempt it, using image sizes of 640x640 or 512x512 will yield output sizes of 80x80 and 64x64 respectively.

Could there be a discrepancy in your explanation, or does your SPP implementation deviate from the original?

glenn-jocher commented 8 months ago

@belafdil-chakib apologies for any confusion. The SPP layer in YOLOv5 does indeed differ from the original SPPNet paper's implementation. In YOLOv5, the SPP layer does not output a fixed-size vector but rather a fixed number of channels with spatial dimensions that depend on the input size. The purpose of SPP in YOLOv5 is to aggregate context at different scales and enhance the receptive field, which is beneficial for detecting objects of various sizes.

The SPPF (Spatial Pyramid Pooling - Fast) layer in YOLOv5 is an optimized version of SPP that uses fewer resources while maintaining the benefits of multi-scale feature aggregation. It does not produce a fixed-length output vector as in the original SPPNet paper but instead concatenates features from max-pooling layers with different kernel sizes to preserve spatial dimensions.

The YOLOv5 architecture is fully convolutional, which allows it to handle varying input sizes, and the network is designed to work with these variable-sized feature maps throughout the model. The output size will change with the input size, but the detection heads are designed to work with these variable-sized outputs, which is why there is no dimension mismatch during training or inference.

I hope this clears up the purpose and behavior of the SPP and SPPF layers in YOLOv5. If you have more questions, feel free to ask!