glenn-jocher commented 4 years ago

🚀 Feature

Andrej Karpathy lists overfitting as one of the main steps in a training pipeline, and a prerequisite to regularization and obtaining a best final result:

http://karpathy.github.io/2019/04/25/recipe/

The approach I like to take to finding a good model has two stages: first get a model large enough that it can overfit (i.e. focus on training loss) and then regularize it appropriately (give up some training loss to improve the validation loss)

On that front, the larger models are overfitting well, while the smallest just about completes training with no overfitting. Here you can observe overfitting in the validation losses (objectness in particular) across the 4 models. The progression of greater overfitting for larger models is observed, as expected. results

I've noticed interestingly in the switch from nn.LeakyReLU(0.1) to nn.Hardswish() that the overfitting has increased. For example here is YOLOv5l v2.0 vs v3.0. Very interestingly, val Objectness actually performs worse with the Hardswish() activations, the mAP gain must originate from better box regressions and better classifications. I do not know why this is, but I see a similar pattern in v5x, with worse val Objectness loss there also. results

Since we have a multi-component loss function, made up of 3 individual losses (box, obj, cls), we would ideally like to reduce overfitting on a loss component basis (starting with Objectness), but this is not common practice that I'm aware of. L1 and L2 regularization techniques that might help in this situation can be targeted to individual parameters by assigning parameter groupings to the optimizer, but I don't think there's a precedent for targeting loss components with different regularization techniques (anyone correct me if I'm wrong).

It may be possible to create separate optimizers for each loss component, and assign then different weight_decays, though I don't know if this would involve a hit to training speed or memory consumption.

In any case, these are my thoughts on the matter. Other things I've tried are increased augmentation, including rotation, mixup, scale and perspective, nn.Dropout2d(0.1), and increased weight_decay using the existing structure, but in all of these cases the training results (on COCO from scratch) become worse. If anyone has any ideas for Objectness regularization, or other techniques to reduce overfitting, please us know!

glenn-jocher commented 4 years ago

After re-reading Karpathy's recipe, he points out that dropout does not play nicely with batchnorm, which makes perfect sense, but had not occurred to me to consider. My previous dropout experiments were not on the final output but 1 or 2 layers prior, with the usual batchnorm and activations following, and later output layers inheriting dropouts of earlier layers. I think I will try an experiment with direct dropout on each of the 3 outputs, which do not pass through batchnorm layers, and do not accumulate. I'll try this with nn.Dropout2d(0.1) on YOLOv5l initially.

EDIT: This would be on the inputs to Detect(), prior to the output nn.Conv2d()

hal-314 commented 4 years ago

Hi @glenn-jocher . You are doing a great job! I'll try to help. Firstly, I think you should pin this question on the top of the issues. So, it has more visibility.

TL:TR Add DropPath and DropBlock except in the last layer + add more augmentations with albumentations + ¿try to reduce Objectness loss contribution? + use SWA (not strictly for overfitting but helps generalization).

I'm not an expert in object detection but dropout doesn't play nicely with ConvNets when is applied in between, at least, for classification problems. Dropout should be only use on the last linear layer. You use DropPath or DropBlock) instead of dropout to add regularization in the middle of the network. Rwightman repository has both of them implemented. He uses to train all EfficientNet net styles.

On the other hand, as data augmentation, you could as change image lightning and contrast as it's done by fastai and Rwightman image model repository. Albumentations has a lot of them an more. You could try CutMix instead of MixUP. I though MixUP wasn't the best option for object detectors, at least, to train the backbone.

One last option it is trying to synchronize overfitting between losses. If you reduce the contribution of Objectness loss to the total loss, you may delay it's overfitting.

Finally, non related to regularization, you could try SWA (stochastic gradient averaging) introduced by Pytorch in 1.6. They just released a great post about it. I don't know how to applies into object detection but it could be worthy to try.

EDIT: you may take a look for Kaggle Global Wheat Competition. YoloV5 was popular there before was banned from the winning competition due to the GPL license.

glenn-jocher commented 4 years ago

@hal-314 thank you for the comments! I've implemented dropout like this, on the Detect() layer inputs:

        self.drop = nn.Dropout2d(0.1)

    def forward(self, x):
        # x = x.copy()  # for profiling
        z = []  # inference output
        self.training |= self.export
        for i in range(self.nl):
            x[i] = self.m[i](self.drop(x[i]))  # conv

I did try cutmix as well, but saw a small mAP drop, though it's infinitely tuneable so its possible I adopted poor settings for it.

Finetuning is very different than training from scratch though, and does seem to benefit from nearly all of these changes (mixup, scale jitter, dropout etc).

I've also observed the same in regards to loss components, that reductions in a loss component gain, such as hyp['obj'] helps to reduce overfitting on that particular component.

SWA (now officially supported in torch 1.6) looks very interesting. We already use EMA (officially supported in TF), which performs a nearly identical function, though in a different manner. I think it's definitely worth trying to swap it for SWA to see the effect.

Yes, the Kaggle competition really brought YOLOv5 into the limelight. A lot of people were surprised that it outperformed EfficientDet. A primary factor is probably image size. Here we train and test at 640, which is far, far below what the larger EfficientDet models train at, 1200-1500 pixels. So I think in trying to finetune a D6, D7 etc model you will only get optimal results at these inflated image sizes, which a lot of the competitors were probably reticent to try due to speed/memory constraints. In the end our main goal is not to make the most accurate architecture though, our main goal is to strike the best compromise that would make this the most usable (and user friendly :) architecture for most tasks.

There is a great ablation study on YOLOv5 wheat detection here: https://www.kaggle.com/c/global-wheat-detection/discussion/172436

hal-314 commented 4 years ago

@glenn-jocher I think that using this dropout like this is fine in the Detect. However, you may be using too little for the bigger networks. For reference, mobilenetV2 uses a dropout of 0.2 in the head. EfficientNet and EfficientNet-Lite family uses 0.2 in the smaller nets + DropPath in all the residual blocks. Here are the training commands for EfficientNet family that achieve SOTA results (DropPath is called drop-connect). As you can see in the this rwightman implementation, it's very easy to implement DropPath. Be careful, I think that dropPath probability in EfficientNets isn't constant.

Finally, other new nets like TResNet doesn't use any type of dropout but the employ a strong data augmentation + imagenet has more images than coco.

glenn-jocher commented 4 years ago

@hal-314 that's really interesting, I did not know about drop-path. Does this just zero some of the residual channels, i.e. is it the same as nn.Dropout2d() on the bottleneck residuals?

I just realized yesteray also (after considering the dropout impact on BN), that our default training regime is very different than the default validation regime:

training: 640x640 mosaics
validation: rectangular batch with 640 on the long side

I tried to modify the mosaic shape to also apply to random rectangle batches with 640 on their long side. The code to do this is here, it acts once per epoch: https://github.com/ultralytics/yolov5/blob/5e0b90de8f7782b3803fa2886bb824c2336358d0/train.py#L231-L234

In general though, it may make sense to do a BN pass on the validation set, or on the training set in --rect mode for construction of BN statistics after training is complete. I believe SWA includes this final step as well. This may help BN better match the validation space rather than the training space.

hal-314 commented 4 years ago

@glenn-jocher DropPath removes the bottleneck (it becomes an identity) with probability of p for each sample in the batch. So, you are reducing the network depth for the sample i. For example, for a batch size of N, a bottleneck layer will be a identity for p*N samples in the batch.

About mosaic transform, I don't know very much about the technique. I saw some images but I didn't pay attention. BN mismatch is a problem as said in Fixing the train-test resolution discrepancy. The problem was the different image sizes, mainly after the global pooling. So, I see several options to solve the domain shift:

Train as usual. Then, retrain again 5-10 epochs more with --rect option with lower LR to finetune the network weights to the real bn stats. As you have overfitting during the first train, you may stop when it starts. Then, train again. -> ideal for a quick test. If mAP improves, we are doing right :)
Train as usual but the lasts epochs (¿5-10, 0.1 of the total?) use --rect option to match bn statistics with real statistics and finetune the network weights. -> ideal for a quick test.
Do not make mosaic image size square and constant. Instead, try to match with image aspect ratio from one of the 4 random images that it's inside the mosaic. So, bn will learn about it. However, make this configurable by passing a target aspect ratio. If it's None, it's selected from mosaic images. <- I think it solves the problem from the beginning.

I think that only updating the BN with the training images isn't as optimal as finetuning. If with that simple trick, mAP increase, it's a good sign. I wouldn't use validation images as you should never use it for anything more than validation.

glenn-jocher commented 4 years ago

@hal-314 I just finished a run of 7 YOLOv5l trainings. 1 was a baseline, and 6 others tried various overfitting suppressions:

obj from 1.0 to 0.5
scale from 0.5 to 0.8
Hardswish() in CSP()
momentum from 0.937 to 0.90
CSP cv2 Leaky(0.1), cv3 Hardswish()
nn.Dropout2d(0.1)

Figure_1

Unfortunately none of my experiments produced higher best.pt mAPs, but two of them provided extremely interesting results (below).

First, changing the scale hyp from 0.5 to 0.8 resulted in much less overfitting across all losses (orange below). This seems like a huge win, except that mAP did not benefit as a result. I don't know the cause, as typically lower val losses correlate with higher mAPs. This is very confusing.

The second great conclusion is that reducing momentum from 0.937 to 0.90 (green line) helps in early training but causes earlier overfitting and lower final mAP. This is a negative result, but we can visualize the implied momentum gradient here in our heads and see that increasing the momentum hyp may result in the opposite effect. I've started two new runs at 0.95 and 0.97 momentum to test this new theory.
results

hal-314 commented 4 years ago

@glenn-jocher I agree with you with momentum but you may need to train longer with scale hyp set to 0.8 to be able to overfit.

I'm also confused with scale results. It may be to lower precision or failing in some sort of objects size (¿small or large?) while improving in general. Reading utils.datasets.py, I don't understand how you are doing zoom out (scale < 1) and assign pixels and bounding boxes outside the original image pixels*. You could easily do this by creating a mosaic image bigger than the target size and then resize it to the target size.

Seeing the improvement with scale, I would suggest to try more data augmentation. Try to use some of the Albumentation transform as I said before.

Finally, it's interesting that the dropout at the end doesn't matters with the overfitting and it's only penalizing the net. I don't understand this result. If you have time, it'd be interesting to see if using dropPath in the backbone makes any difference.

Reading closely the random_perspective function, I saw that you scale here R[:2] = cv2.getRotationMatrix2D(angle=a, center=(0, 0), scale=s) and add a border padding here:
```
     if perspective:
        img = cv2.warpPerspective(img, M, dsize=(width, height), borderValue=(114, 114, 114))
    else:  # affine
        img = cv2.warpAffine(img, M[:2], dsize=(width, height), borderValue=(114, 114, 114))
```
Doing so much zoom out could hurt as you are introducing a lot of constant padding. I think that it could hurt network performance. Usually, it's better using padding="reflection" in torchvision transforms instead of "padding="border". You could easy fake a "padding=reflection" when using mosaic. I would make a mosaic bigger than the target size and then resize to the wanted size. So, you avoid to add so much padding.

drop -> interesante que solo afecte en objectness y que reducza el error de classificacion y GioU. Probar más data augmentation. Claramente, con el zoom mejora.

hal-314 commented 4 years ago

I just read PP Yolo paper. It based on YOLOv3 but with ResNet backbone + multiple easy tricks not related to data augmentation. For example, DropBlock to avoid overfitting. It's faster than YoloV4 (and yoloV5) at the same accuracy. So, you may want to try DropBlock or DropPath to avoid overfitting.

I think that YoloV5 applies some of the tricks but not all of them. YoloV5 could greatly benefit from the others :)

WongKinYiu commented 4 years ago

Maybe larger parameter of translation?

Figure from EfficientDet.

glenn-jocher commented 4 years ago

Yes, higher scale jitter seems to help, though it helps when finetuning more than when training from scratch.

WongKinYiu commented 3 years ago

larger scale jittering works well.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

glenn-jocher commented 3 years ago

@hal-314 see PR #3882 for a proposed automatic Albumentations integration.

glenn-jocher commented 3 years ago

@hal-314 good news 😃! Your original issue may now be fixed ✅ in PR #3882. This PR implements a YOLOv5 🚀 + Albumentations integration. The integration will automatically apply Albumentations transforms during YOLOv5 training if albumentations>=1.0.0 is installed in your environment.

Get Started

To use albumentations simply pip install -U albumentations and then update the augmentation pipeline as you see fit in the Albumentations class in yolov5/utils/augmentations.py. Note these Albumentations operations run in addition to the YOLOv5 hyperparameter augmentations, i.e. defined in hyp.scratch.yaml.

class Albumentations:
    # YOLOv5 Albumentations class (optional, used if package is installed)
    def __init__(self):
        self.transform = None
        try:
            import albumentations as A
            check_version(A.__version__, '1.0.0')  # version requirement

            self.transform = A.Compose([
                A.Blur(p=0.1),
                A.MedianBlur(p=0.1),
                A.ToGray(p=0.01)],
                bbox_params=A.BboxParams(format='yolo', label_fields=['class_labels']))

            logging.info(colorstr('albumentations: ') + ', '.join(f'{x}' for x in self.transform.transforms))
        except ImportError:  # package not installed, skip
            pass
        except Exception as e:
            logging.info(colorstr('albumentations: ') + f'{e}')

    def __call__(self, im, labels, p=1.0):
        if self.transform and random.random() < p:
            new = self.transform(image=im, bboxes=labels[:, 1:], class_labels=labels[:, 0])  # transformed
            im, labels = new['image'], np.array([[c, *b] for c, b in zip(new['class_labels'], new['bboxes'])])
        return im, labels

Example Result

Example train_batch0.jpg on COCO128 dataset with Blur, MedianBlur and ToGray. See the YOLOv5 Notebooks to reproduce:

train_batch0

Update

To receive this YOLOv5 update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

fcakyon commented 2 years ago

@glenn-jocher is there any dropblock or droppath regularization utilized in the current yolov5 implementation?

glenn-jocher commented 2 years ago

@fcakyon no

GMN23362 commented 2 years ago

@fcakyon no

Have you tried to use dropblock before? Does it perform well? Hope to see your reply!

saifmassoudsaif commented 1 year ago

@glenn-jocher Please give me how to plot all models in the same figures.

glenn-jocher commented 1 year ago

@saifmassoudsaif to plot all models in the same figure, you can use the matplotlib library in Python. Here's a code snippet that demonstrates how to achieve this:

import matplotlib.pyplot as plt

# Define your models and their corresponding data
# For example:
models = ['YOLOv5s', 'YOLOv5m', 'YOLOv5l', 'YOLOv5x']
losses = [0.1, 0.3, 0.2, 0.15]

# Create a figure and axis
fig, ax = plt.subplots()

# Plot the models and losses
ax.plot(models, losses)

# Set labels and title
ax.set_xlabel('Models')
ax.set_ylabel('Loss')
ax.set_title('Losses for YOLOv5 models')

# Show the plot
plt.show()

This code will create a figure with the YOLOv5 models on the x-axis and the corresponding losses on the y-axis. You can customize the plot as per your requirements. Hope this helps!

ultralytics / yolov5

Reducing YOLOv5 Overfitting on COCO from scratch #746

🚀 Feature

Get Started

Example Result

Update