custom anchors get flushed when loading pretrain weights

AlexWang1900 commented 4 years ago

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

Current repo: run git fetch && git status -uno to check and git pull to update repo
Common dataset: coco.yaml or coco128.yaml
Common environment: Colab, Google Cloud, or Docker image. See https://github.com/ultralytics/yolov5#reproduce-our-environment

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

🐛 Bug

in train.py , the anchors set by user in yaml file are flushed by pretrain weights.


    if weights.endswith('.pt'):  # pytorch format
        ckpt = torch.load(weights, map_location=device)  # load checkpoint

        # load model
        try:
            ckpt['model'] = {k: v for k, v in ckpt['model'].float().state_dict().items()
                             if model.state_dict()[k].shape == v.shape}  # to FP32, filter
            #print(ckpt['model'].keys())
            **#ckpt['model'].pop('model.27.anchors')  
            #ckpt['model'].pop('model.27.anchor_grid')**

            model.load_state_dict(ckpt['model'], strict=False)
        except KeyError as e:
            s = "%s is not compatible with %s. This may be due to model differences or %s may be out of date. " \
                "Please delete or update %s and try again, or use --weights '' to train from scratch." \
                % (opt.weights, opt.cfg, opt.weights, opt.weights)
            raise KeyError(s) from e

To Reproduce (REQUIRED)

Input: in ./model/yolov5x.yaml change anchors' shape to any other than default.

Output: the anchors set in yaml file didn't activated .

Expected behavior

A clear and concise description of what you expected to happen.

Environment

If applicable, add screenshots to help explain your problem.

OS: [Ubuntu]
GPU [2080 Ti]

Additional context

if the anchors set by user in yaml file, is more than 9 anchors, the bug didn't get triggered because it did not match the pretrain weight's anchors' shape.

github-actions[bot] commented 4 years ago

Hello @AlexWang1900, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@AlexWang1900 thanks for looking into this. AutoAnchor will run always after this region of code, but you are saying that prior to AutoAnchor running a model.yaml's anchors may be replaced by the pretrained weight's anchors?

The anchors are attributes of the Detect() layer, they are buffers rather than parameters, so they are not assigned a gradient but they are still transferred across devices which is useful for loss computations and inference etc. I will try to reproduce your experiment to see if the yaml anchors are being overwritten.

glenn-jocher commented 4 years ago

@AlexWang1900 yes, I am able to reproduce your bug! Ok, so it seems that buffers are being transferred from pretrained weight as well as parameters. I suppose we can simply reject any transfer with 'anchor' in the key. This raises a bigger question though, if I look at the entire list of info transferred I also see a lot of batchnorm info, such as batches tracked:

This is the total list of modules transferred from pretrained weights to the randomly initialized weights. 364 of 370 parameters transferred if the number of classes is different. The 3 output layer .weight and .bias are omitted.

model.0.conv.conv.weight
model.0.conv.bn.weight
model.0.conv.bn.bias
model.0.conv.bn.running_mean
model.0.conv.bn.running_var
model.0.conv.bn.num_batches_tracked
model.1.conv.weight
model.1.bn.weight
model.1.bn.bias
model.1.bn.running_mean
model.1.bn.running_var
model.1.bn.num_batches_tracked
model.2.cv1.conv.weight
model.2.cv1.bn.weight
model.2.cv1.bn.bias
model.2.cv1.bn.running_mean
model.2.cv1.bn.running_var
model.2.cv1.bn.num_batches_tracked
model.2.cv2.weight
model.2.cv3.weight
model.2.cv4.conv.weight
model.2.cv4.bn.weight
model.2.cv4.bn.bias
model.2.cv4.bn.running_mean
model.2.cv4.bn.running_var
model.2.cv4.bn.num_batches_tracked
model.2.bn.weight
model.2.bn.bias
model.2.bn.running_mean
model.2.bn.running_var
model.2.bn.num_batches_tracked
model.2.m.0.cv1.conv.weight
model.2.m.0.cv1.bn.weight
model.2.m.0.cv1.bn.bias
model.2.m.0.cv1.bn.running_mean
model.2.m.0.cv1.bn.running_var
model.2.m.0.cv1.bn.num_batches_tracked
model.2.m.0.cv2.conv.weight
model.2.m.0.cv2.bn.weight
model.2.m.0.cv2.bn.bias
model.2.m.0.cv2.bn.running_mean
model.2.m.0.cv2.bn.running_var
model.2.m.0.cv2.bn.num_batches_tracked
model.3.conv.weight
model.3.bn.weight
model.3.bn.bias
model.3.bn.running_mean
model.3.bn.running_var
model.3.bn.num_batches_tracked
model.4.cv1.conv.weight
model.4.cv1.bn.weight
model.4.cv1.bn.bias
model.4.cv1.bn.running_mean
model.4.cv1.bn.running_var
model.4.cv1.bn.num_batches_tracked
model.4.cv2.weight
model.4.cv3.weight
model.4.cv4.conv.weight
model.4.cv4.bn.weight
model.4.cv4.bn.bias
model.4.cv4.bn.running_mean
model.4.cv4.bn.running_var
model.4.cv4.bn.num_batches_tracked
model.4.bn.weight
model.4.bn.bias
model.4.bn.running_mean
model.4.bn.running_var
model.4.bn.num_batches_tracked
model.4.m.0.cv1.conv.weight
model.4.m.0.cv1.bn.weight
model.4.m.0.cv1.bn.bias
model.4.m.0.cv1.bn.running_mean
model.4.m.0.cv1.bn.running_var
model.4.m.0.cv1.bn.num_batches_tracked
model.4.m.0.cv2.conv.weight
model.4.m.0.cv2.bn.weight
model.4.m.0.cv2.bn.bias
model.4.m.0.cv2.bn.running_mean
model.4.m.0.cv2.bn.running_var
model.4.m.0.cv2.bn.num_batches_tracked
model.4.m.1.cv1.conv.weight
model.4.m.1.cv1.bn.weight
model.4.m.1.cv1.bn.bias
model.4.m.1.cv1.bn.running_mean
model.4.m.1.cv1.bn.running_var
model.4.m.1.cv1.bn.num_batches_tracked
model.4.m.1.cv2.conv.weight
model.4.m.1.cv2.bn.weight
model.4.m.1.cv2.bn.bias
model.4.m.1.cv2.bn.running_mean
model.4.m.1.cv2.bn.running_var
model.4.m.1.cv2.bn.num_batches_tracked
model.4.m.2.cv1.conv.weight
model.4.m.2.cv1.bn.weight
model.4.m.2.cv1.bn.bias
model.4.m.2.cv1.bn.running_mean
model.4.m.2.cv1.bn.running_var
model.4.m.2.cv1.bn.num_batches_tracked
model.4.m.2.cv2.conv.weight
model.4.m.2.cv2.bn.weight
model.4.m.2.cv2.bn.bias
model.4.m.2.cv2.bn.running_mean
model.4.m.2.cv2.bn.running_var
model.4.m.2.cv2.bn.num_batches_tracked
model.5.conv.weight
model.5.bn.weight
model.5.bn.bias
model.5.bn.running_mean
model.5.bn.running_var
model.5.bn.num_batches_tracked
model.6.cv1.conv.weight
model.6.cv1.bn.weight
model.6.cv1.bn.bias
model.6.cv1.bn.running_mean
model.6.cv1.bn.running_var
model.6.cv1.bn.num_batches_tracked
model.6.cv2.weight
model.6.cv3.weight
model.6.cv4.conv.weight
model.6.cv4.bn.weight
model.6.cv4.bn.bias
model.6.cv4.bn.running_mean
model.6.cv4.bn.running_var
model.6.cv4.bn.num_batches_tracked
model.6.bn.weight
model.6.bn.bias
model.6.bn.running_mean
model.6.bn.running_var
model.6.bn.num_batches_tracked
model.6.m.0.cv1.conv.weight
model.6.m.0.cv1.bn.weight
model.6.m.0.cv1.bn.bias
model.6.m.0.cv1.bn.running_mean
model.6.m.0.cv1.bn.running_var
model.6.m.0.cv1.bn.num_batches_tracked
model.6.m.0.cv2.conv.weight
model.6.m.0.cv2.bn.weight
model.6.m.0.cv2.bn.bias
model.6.m.0.cv2.bn.running_mean
model.6.m.0.cv2.bn.running_var
model.6.m.0.cv2.bn.num_batches_tracked
model.6.m.1.cv1.conv.weight
model.6.m.1.cv1.bn.weight
model.6.m.1.cv1.bn.bias
model.6.m.1.cv1.bn.running_mean
model.6.m.1.cv1.bn.running_var
model.6.m.1.cv1.bn.num_batches_tracked
model.6.m.1.cv2.conv.weight
model.6.m.1.cv2.bn.weight
model.6.m.1.cv2.bn.bias
model.6.m.1.cv2.bn.running_mean
model.6.m.1.cv2.bn.running_var
model.6.m.1.cv2.bn.num_batches_tracked
model.6.m.2.cv1.conv.weight
model.6.m.2.cv1.bn.weight
model.6.m.2.cv1.bn.bias
model.6.m.2.cv1.bn.running_mean
model.6.m.2.cv1.bn.running_var
model.6.m.2.cv1.bn.num_batches_tracked
model.6.m.2.cv2.conv.weight
model.6.m.2.cv2.bn.weight
model.6.m.2.cv2.bn.bias
model.6.m.2.cv2.bn.running_mean
model.6.m.2.cv2.bn.running_var
model.6.m.2.cv2.bn.num_batches_tracked
model.7.conv.weight
model.7.bn.weight
model.7.bn.bias
model.7.bn.running_mean
model.7.bn.running_var
model.7.bn.num_batches_tracked
model.8.cv1.conv.weight
model.8.cv1.bn.weight
model.8.cv1.bn.bias
model.8.cv1.bn.running_mean
model.8.cv1.bn.running_var
model.8.cv1.bn.num_batches_tracked
model.8.cv2.conv.weight
model.8.cv2.bn.weight
model.8.cv2.bn.bias
model.8.cv2.bn.running_mean
model.8.cv2.bn.running_var
model.8.cv2.bn.num_batches_tracked
model.9.cv1.conv.weight
model.9.cv1.bn.weight
model.9.cv1.bn.bias
model.9.cv1.bn.running_mean
model.9.cv1.bn.running_var
model.9.cv1.bn.num_batches_tracked
model.9.cv2.weight
model.9.cv3.weight
model.9.cv4.conv.weight
model.9.cv4.bn.weight
model.9.cv4.bn.bias
model.9.cv4.bn.running_mean
model.9.cv4.bn.running_var
model.9.cv4.bn.num_batches_tracked
model.9.bn.weight
model.9.bn.bias
model.9.bn.running_mean
model.9.bn.running_var
model.9.bn.num_batches_tracked
model.9.m.0.cv1.conv.weight
model.9.m.0.cv1.bn.weight
model.9.m.0.cv1.bn.bias
model.9.m.0.cv1.bn.running_mean
model.9.m.0.cv1.bn.running_var
model.9.m.0.cv1.bn.num_batches_tracked
model.9.m.0.cv2.conv.weight
model.9.m.0.cv2.bn.weight
model.9.m.0.cv2.bn.bias
model.9.m.0.cv2.bn.running_mean
model.9.m.0.cv2.bn.running_var
model.9.m.0.cv2.bn.num_batches_tracked
model.10.conv.weight
model.10.bn.weight
model.10.bn.bias
model.10.bn.running_mean
model.10.bn.running_var
model.10.bn.num_batches_tracked
model.13.cv1.conv.weight
model.13.cv1.bn.weight
model.13.cv1.bn.bias
model.13.cv1.bn.running_mean
model.13.cv1.bn.running_var
model.13.cv1.bn.num_batches_tracked
model.13.cv2.weight
model.13.cv3.weight
model.13.cv4.conv.weight
model.13.cv4.bn.weight
model.13.cv4.bn.bias
model.13.cv4.bn.running_mean
model.13.cv4.bn.running_var
model.13.cv4.bn.num_batches_tracked
model.13.bn.weight
model.13.bn.bias
model.13.bn.running_mean
model.13.bn.running_var
model.13.bn.num_batches_tracked
model.13.m.0.cv1.conv.weight
model.13.m.0.cv1.bn.weight
model.13.m.0.cv1.bn.bias
model.13.m.0.cv1.bn.running_mean
model.13.m.0.cv1.bn.running_var
model.13.m.0.cv1.bn.num_batches_tracked
model.13.m.0.cv2.conv.weight
model.13.m.0.cv2.bn.weight
model.13.m.0.cv2.bn.bias
model.13.m.0.cv2.bn.running_mean
model.13.m.0.cv2.bn.running_var
model.13.m.0.cv2.bn.num_batches_tracked
model.14.conv.weight
model.14.bn.weight
model.14.bn.bias
model.14.bn.running_mean
model.14.bn.running_var
model.14.bn.num_batches_tracked
model.17.cv1.conv.weight
model.17.cv1.bn.weight
model.17.cv1.bn.bias
model.17.cv1.bn.running_mean
model.17.cv1.bn.running_var
model.17.cv1.bn.num_batches_tracked
model.17.cv2.weight
model.17.cv3.weight
model.17.cv4.conv.weight
model.17.cv4.bn.weight
model.17.cv4.bn.bias
model.17.cv4.bn.running_mean
model.17.cv4.bn.running_var
model.17.cv4.bn.num_batches_tracked
model.17.bn.weight
model.17.bn.bias
model.17.bn.running_mean
model.17.bn.running_var
model.17.bn.num_batches_tracked
model.17.m.0.cv1.conv.weight
model.17.m.0.cv1.bn.weight
model.17.m.0.cv1.bn.bias
model.17.m.0.cv1.bn.running_mean
model.17.m.0.cv1.bn.running_var
model.17.m.0.cv1.bn.num_batches_tracked
model.17.m.0.cv2.conv.weight
model.17.m.0.cv2.bn.weight
model.17.m.0.cv2.bn.bias
model.17.m.0.cv2.bn.running_mean
model.17.m.0.cv2.bn.running_var
model.17.m.0.cv2.bn.num_batches_tracked
model.19.conv.weight
model.19.bn.weight
model.19.bn.bias
model.19.bn.running_mean
model.19.bn.running_var
model.19.bn.num_batches_tracked
model.21.cv1.conv.weight
model.21.cv1.bn.weight
model.21.cv1.bn.bias
model.21.cv1.bn.running_mean
model.21.cv1.bn.running_var
model.21.cv1.bn.num_batches_tracked
model.21.cv2.weight
model.21.cv3.weight
model.21.cv4.conv.weight
model.21.cv4.bn.weight
model.21.cv4.bn.bias
model.21.cv4.bn.running_mean
model.21.cv4.bn.running_var
model.21.cv4.bn.num_batches_tracked
model.21.bn.weight
model.21.bn.bias
model.21.bn.running_mean
model.21.bn.running_var
model.21.bn.num_batches_tracked
model.21.m.0.cv1.conv.weight
model.21.m.0.cv1.bn.weight
model.21.m.0.cv1.bn.bias
model.21.m.0.cv1.bn.running_mean
model.21.m.0.cv1.bn.running_var
model.21.m.0.cv1.bn.num_batches_tracked
model.21.m.0.cv2.conv.weight
model.21.m.0.cv2.bn.weight
model.21.m.0.cv2.bn.bias
model.21.m.0.cv2.bn.running_mean
model.21.m.0.cv2.bn.running_var
model.21.m.0.cv2.bn.num_batches_tracked
model.23.conv.weight
model.23.bn.weight
model.23.bn.bias
model.23.bn.running_mean
model.23.bn.running_var
model.23.bn.num_batches_tracked
model.25.cv1.conv.weight
model.25.cv1.bn.weight
model.25.cv1.bn.bias
model.25.cv1.bn.running_mean
model.25.cv1.bn.running_var
model.25.cv1.bn.num_batches_tracked
model.25.cv2.weight
model.25.cv3.weight
model.25.cv4.conv.weight
model.25.cv4.bn.weight
model.25.cv4.bn.bias
model.25.cv4.bn.running_mean
model.25.cv4.bn.running_var
model.25.cv4.bn.num_batches_tracked
model.25.bn.weight
model.25.bn.bias
model.25.bn.running_mean
model.25.bn.running_var
model.25.bn.num_batches_tracked
model.25.m.0.cv1.conv.weight
model.25.m.0.cv1.bn.weight
model.25.m.0.cv1.bn.bias
model.25.m.0.cv1.bn.running_mean
model.25.m.0.cv1.bn.running_var
model.25.m.0.cv1.bn.num_batches_tracked
model.25.m.0.cv2.conv.weight
model.25.m.0.cv2.bn.weight
model.25.m.0.cv2.bn.bias
model.25.m.0.cv2.bn.running_mean
model.25.m.0.cv2.bn.running_var
model.25.m.0.cv2.bn.num_batches_tracked
model.27.anchors
model.27.anchor_grid

AlexWang1900 commented 4 years ago

@glenn-jocher you suggests to omit bn weights? I have to test it on custom datasets. I will test it on Kaggle global Wheat detection , to see if it is worthy to omit pretrained bn weights.

glenn-jocher commented 4 years ago

Current implementation: https://github.com/ultralytics/yolov5/blob/1e95337f3aec4c12244802bb6e493b07b27aa795/train.py#L131-L135

Proposed fix:

        # load model
        try:
            exclude = ['anchor', 'tracked']  # exclude list
            ckpt['model'] = {k: v for k, v in ckpt['model'].float().state_dict().items()
                             if k in model.state_dict() and not any(x in k for x in exclude)
                             and model.state_dict()[k].shape == v.shape}

            model.load_state_dict(ckpt['model'], strict=False)
            print('Transferred %g/%g items from %s' % (len(ckpt['model']), len(model.state_dict()), weights))

Using current master: Transferred 364/370 items from yolov5s.pt Adding 'anchor' to exclude list: Transferred 362/370 items from yolov5s.pt Adding 'anchors' and 'tracked' to exclude list: Transferred 303/370 items from yolov5s.pt

Ok so this fix appears to be solid. Philosophically speaking though, you might want to be careful modifying anchors significantly on pretrained weights. AutoAnchor threshold for action is 0.99 best possible recall (BPR), anything above this and it leaves the anchors alone. If the modified anchors are significantly different than the pretrained weights anchors, the new model may take significant training time to adjust all of the regression-related neurons to fully compensate (i.e. hundreds of epochs for smaller datasets).

glenn-jocher commented 4 years ago

Also, lastly, we need some empirical (experimental) results to guide us on whether to exclude all of the num_batches_tracked values from being transferred. I think the other 4 are beneficial to transfer:

model.25.m.0.cv1.bn.weight
model.25.m.0.cv1.bn.bias
model.25.m.0.cv1.bn.running_mean
model.25.m.0.cv1.bn.running_var
model.25.m.0.cv1.bn.num_batches_tracked  # should we start this from zero ??

AlexWang1900 commented 4 years ago

Also, lastly, we need some empirical (experimental) results to guide us on whether to exclude all of the num_batches_tracked values from being transferred. I think the other 4 are beneficial to transfer:
model.25.m.0.cv1.bn.weight
model.25.m.0.cv1.bn.bias
model.25.m.0.cv1.bn.running_mean
model.25.m.0.cv1.bn.running_var
model.25.m.0.cv1.bn.num_batches_tracked  # should we start this from zero ??

Okay!!! I will start testing from there!! I will test it on Kaggle global Wheat detection dataset

glenn-jocher commented 4 years ago

@AlexWang1900 ok sounds good. BTW, when you train on wheat detection what's the BPR you get initially (is it above 0.99?). I opened up a PR https://github.com/ultralytics/yolov5/pull/462 with the proposed fix. Not sure if I should add 'tracked' to the list, I'll wait for your experiment results.

I think if the batchnorm tracking starts from 0 it will be more noisy initially, but also more quickly adapt to the new dataset (possibly converge faster, and/or possibly overfit faster). But I really have no idea. Suggest maybe train with exclude=['anchor'] and exclude=['anchor', 'tracked'] and plot both results together to try to spot any differences.

You can overlay multiple training results togethor by copying both to your main yolov5/ directory and then running: from utils.utils import *; plot_results(), it will plot any results*.txt files it finds in the main directory.

AlexWang1900 commented 4 years ago

@AlexWang1900 ok sounds good. BTW, when you train on wheat detection what's the BPR you get initially (is it above 0.99?). I opened up a PR #462 with the proposed fix. Not sure if I should add 'tracked' to the list, I'll wait for your experiment results.

I think if the batchnorm tracking starts from 0 it will be more noisy initially, but also more quickly adapt to the new dataset (possibly converge faster, and/or possibly overfit faster). But I really have no idea. Suggest maybe train with exclude=['anchor'] and exclude=['anchor', 'tracked'] and plot both results together to try to spot any differences.

You can overlay multiple training results togethor by copying both to your main yolov5/ directory and then running: from utils.utils import *; plot_results(), it will plot any results*.txt files it finds in the main directory.

for the BPR, it is above 0.99 if the default anchor_t =4.0. if I set anchor_t = 2.0 ,BPR falls to 0.9847~ then it starts using k-means to caculate new anchors. new anchors get lower result for map 0.5 and map 0.5-0.75. drops 3%,2% on validation set. I think it is because new anchors are much smaller and anchor_t = 2.0 makes much fewer proposal for positive targets. it seems the kmeans for anchor caculation has some issue, : the k * standard makes anchors smaller

    # Kmeans calculation
    from scipy.cluster.vq import kmeans
    print('Running kmeans for %g anchors on %g points...' % (n, len(wh)))
    s = wh.std(0)  # sigmas for whitening
    natural_k,dist = kmeans(wh,5,iter=300)
    print("natural:",natural_k)
    k, dist = kmeans(wh, n, iter=30)#kmeans(wh / s, n, iter=30)  # points, mean distance
    #k *= s
    wh = torch.tensor(wh, dtype=torch.float32)  # filtered
    wh0 = torch.tensor(wh0, dtype=torch.float32)  # unflitered
    k = print_results(k)

it is better without standard on validataion set, but I forgot the exact numbers.

So finnally I caculated 9 anchors and add them aside original anchors totally 18 anchors. and set anchor_t =4.0, 3.0, 2.0 , all have 0.1%-%0.2 rising on validation score map 0.5-0.75.

but unfortunelly they lower the score on kaggel test set 1%, because, maybe ,according to the dataset paper the dataset was labelled using yolov3 with default settings and obj_score >=0.5 , then hand crafted for mistakes.

AlexWang1900 commented 4 years ago

@glenn-jocher Hi~~ here is the astonishing result of the tests: results

test_1: original baseline without tricks, best map0.5-0.75: 0.7981 at epoch 23 test_2: omit 'tracked' , best :0.798 at epoch 20 test_3: omit 'tracked', 'running_mean''running_var', best: 0.0.7995 at epoch 22 test_4:omit 'bn' best: 0.8 at epoch 35

from the pytorch code:


               if self.momentum is None:  # use cumulative moving average
                    exponential_average_factor = 1.0 / float(self.num_batches_tracked)
                else:  # use exponential moving average
                    exponential_average_factor = self.momentum

with 'tracked batches', it is dominate for the pretrained mean and var when calculating moving average , take an example tracked = 1000, then (1-1/1000)^100 = 0.905, after 100 batches , (1−1/1000)^1000 = 0.365 after 1000 batches, old bn mean and var still have roughly 0.36 left. omitting 'tracked ' , exponential_average_factor = momentum = 0.1 by default, it takes 40 batches to wipe out old bn means and var.

but it is higher if whole bn removed , I can't quite understand that, I was predicting test_3 maybe the best because bn weights are randomly initialized in test_4

and the confidence level here is high, for I have ran 50+ times with different tricks or hyperparameters ,the 0.8 is the highest score above all.

I suggest from a user's perspective:

1) user want to incrementally transfer learning, like user have some pictures at night, want to increase the night performance, while keeping the yolo performance with original scene. it is better to keep the whole bn. 2) user want to transfer learning for another domain, and doesn't care about the performance on general scene, just like Wheat detection, it is better to omit whole bn. 3) when I am training, the system reboot, then I have to load last.pt then continue to run it has to keep bn

at the end of day , it's up to you ~~~ need more tests ,welcome to say~~~

glenn-jocher commented 4 years ago

@AlexWang1900 wow, thanks for running the tests! Surprisingly removing the tracked stats has almost no effect, though if you remove bn completely there is some noticeably higher instability in the early epochs. I think the highest mAPs are all essentially identical. This is wheat detection?

I'll go ahead and push the PR with just 'anchor' in the exclude list.

AlexWang1900 commented 4 years ago

yes.. this is wheat detection. from normal experience the highest maps are in the range of 0.02. no big difference. but after running with it 50times it is the first time which reach 0.8

---Original--- From: "Glenn Jocher"<notifications@github.com> Date: Thu, Jul 23, 2020 03:29 AM To: "ultralytics/yolov5"<yolov5@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [ultralytics/yolov5] custom anchors get flushed when loading pretrain weights (#459)

@AlexWang1900 wow, thanks for running the tests! Surprisingly removing the tracked stats has almost no effect, though if you remove bn completely there is some noticeably higher instability in the early epochs. I think the highest mAPs are all essentially identical. This is wheat detection?

I'll go ahead and push the PR with just 'anchor' in the exclude list.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

glenn-jocher commented 4 years ago

@AlexWang1900 ah, so you are saying that the very best result was after removing 100% of the bn stats (all 5 values for each bn layer)?

glenn-jocher commented 4 years ago

One item in your plots is that we want all 3 (2) val losses to bottom around the same time. If one bottoms before the other the peak mAP may not be as high as it could be. GIoU has a broad base (it generalizes well), obj has a much sharper base, and I've always struggled to prevent if from overtraining more. On COCO, wheat, and most others obj and cls overfit before GIoU.

AlexWang1900 commented 4 years ago

yes, among other tricks and hyper parameters changed, change bn along gets highest validation score. now combined with mixup it gets higher.

---Original--- From: "Glenn Jocher"<notifications@github.com> Date: Thu, Jul 23, 2020 11:10 AM To: "ultralytics/yolov5"<yolov5@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [ultralytics/yolov5] custom anchors get flushed when loading pretrain weights (#459)

@AlexWang1900 ah, so you are saying that the very best result was after removing 100% of the bn stats (all 5 values for each bn layer)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

glenn-jocher commented 4 years ago

@AlexWang1900 I have an idea, you could try modifying the L24 activation function in the Conv() layer from LeakyReLU(0.1) to Swish() or Mish() to see if this helps wheat training. This will change almost all activations across the entire model. I've never tried this, but it may be possible to still start from pretraind weights when you do this: https://github.com/ultralytics/yolov5/blob/5e970d45c44fff11d1eb29bfc21bed9553abf986/models/common.py#L18-L31

You'll have to reduce your batch size as these will consume much greater GPU RAM when training, and initial results may be poorer, but final mAP may be higher...

glenn-jocher commented 4 years ago

@AlexWang1900 also since pretrained weights are helping so much, it may make sense to freeze 100% of the transferred weights for the first few epochs before unfreezing. otherwise the pretrained layer gradients are being affected by the randomly initialized layers. An example training pipeline might be:

Load pretrained weights, save keys which are successfully transferred into transferred list.
Freeze parameters in transferred layers by setting x.requires_grad=False
Train 1-10 epochs.
if epoch == 10: Unfreeze all layers
continue training

AlexWang1900 commented 4 years ago

Thanks a lot!!!! I will test it!!!!

---Original--- From: "Glenn Jocher"<notifications@github.com> Date: Thu, Jul 23, 2020 11:27 AM To: "ultralytics/yolov5"<yolov5@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [ultralytics/yolov5] custom anchors get flushed when loading pretrain weights (#459)

@AlexWang1900 also since pretrained weights are helping so much, it may make sense to freeze 100% of the transferred weights for the first few epochs before unfreezing. otherwise the pretrained layer gradients are being affected by the randomly initialized layers. An example training pipeline might be:

Load pretrained weights, save keys which are successfully transferred into transferred list.

Freeze keys in transferred

Train 1-10 epochs.

if epoch == 10: Unfreeze all layers

continue training

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

glenn-jocher commented 4 years ago

@AlexWang1900 I should probably add an optional train.py argument for this. Maybe something like --freeze-for or --freeze-epochs, i.e. to do my above tasks:

python train.py --weights yolov5x.pt --freeze 10

TODO: Add train.py argument to freeze transferred layers for a certain number of epochs.

AlexWang1900 commented 4 years ago

@AlexWang1900 I should probably add an optional train.py argument for this. Maybe something like --freeze-for or --freeze-epochs, i.e. to do my above tasks:

python train.py --weights yolov5x.pt --freeze 10

TODO: Add train.py argument to freeze transferred layers for a certain number of epochs.

@glenn-jocher

I have implemented a simple test version and now it is running, waiting for result. it is like this:

    if weights.endswith('.pt'):  # pytorch format
        ckpt = torch.load(weights, map_location=device)  # load checkpoint

        # load model
        try:
            exclude = ['anchor', 'tracked','running_mean','running_var','bn']
            ckpt['model'] = {k: v for k, v in ckpt['model'].float().state_dict().items()
                             if k in model.state_dict() and not any(x in k for x in exclude)
                             and model.state_dict()[k].shape == v.shape}
            model.load_state_dict(ckpt['model'], strict=False)
            print('Transferred %g/%g items from %s' % (len(ckpt['model']), len(model.state_dict()), weights))
# add 
            freeze_layers = []
            for key in ckpt['model'].keys():
                freeze_layers.append(key)
            frozen_layers = deepcopy(freeze_layers)
            for name, param in model.named_parameters():
                #print(name)
                for element in freeze_layers:
                    #print(element)
                    if element in name:
                        param.requires_grad = False
                        freeze_layers.remove(element)
                        break
# add end
        except KeyError as e:
            s = "%s is not compatible with %s. This may be due to model differences or %s may be out of date. " \
                "Please delete or update %s and try again, or use --weights '' to train from scratch." \
                % (weights, opt.cfg, weights, weights)
            raise KeyError(s) from e

    for epoch in range(start_epoch, epochs):  # epoch ------------------------------------------------------------------
        model.train()
#add 
        if epoch >=2 and len(frozen_layers)>0:
            for name, param in model.named_parameters():
                #print(name)
                for element in frozen_layers:
                    #print(element)
                    if element in name:
                        param.requires_grad = True
                        frozen_layers.remove(element)
                        break
#add end

glenn-jocher commented 4 years ago

@AlexWang1900 looks about right. The warmup is 3 epochs, so you might want to experiment with freezing < 3 epochs and then freezing > 3 epochs. I think you can refactor the first for loop also: freeze_layers = ckpt['model'].keys()

AlexWang1900 commented 4 years ago

here are the results for freeze <4 , it seems overfit at 4.

results

result_1 without freeze best map0.5-0.75 0.8025 at epoch 54/60.
result_2 freeze bset map0.5-0.75 0.7991 at epoch 60/60
both lr cosine decay(end epochs = 80)

glenn-jocher commented 4 years ago

Wow that did not work well. My warmup strategy must be very important then. Maybe just try freezing 1 epoch and extend the warmup to 4 epochs up from 3 now. Then you get 1 epoch full frozen while still enjoying the full 3 epoch warmup effects on all layers.

BTW you can increase TTA augmentations, I’ve only used a minimum of 3 simple augmentations in the default TTA code. You can add vertical flip and extra sizes for better final mAP.

On Thu, 23 Jul 2020 at 18:47, AlexWang1900 notifications@github.com wrote:

here are the results for freeze <4 , it seems overfit at 4.

[image: results] https://user-images.githubusercontent.com/60679873/88353945-f9049300-cd91-11ea-90f5-80a2c2aea546.png

result_1 without freeze best map0.5-0.75 0.8025 at epoch 54/60. result_2 freeze bset map0.5-0.75 0.7991 at epoch 60/60 both lr cosine decay(end epochs = 80)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov5/issues/459#issuecomment-663312980, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMXEGMRCCPPAATU52MDC4LR5DRZXANCNFSM4PDBTPBQ .

-- https://www.ultralytics.com/

Glenn JocherFounder & CEO, Ultralytics LLC +1 301 237 6695 https://www.facebook.com/ultralytics https://www.twitter.com/ultralytics https://www.youtube.com/ultralytics https://www.github.com/ultralytics https://www.linkedin.com/company/ultralytics https://www.instagram.com/ultralytics https://contact.ultralytics.com/

AlexWang1900 commented 4 years ago

for swish and mish, I am looking into some effiecient implemetations like here: https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/activations_me.py

I have used mish and swish for image classification, it converges faster , but tend to overfit faster.

AlexWang1900 commented 4 years ago

Wow that did not work well. My warmup strategy must be very important then. Maybe just try freezing 1 epoch and extend the warmup to 4 epochs up from 3 now. Then you get 1 epoch full frozen while still enjoying the full 3 epoch warmup effects on all layers. BTW you can increase TTA augmentations, I’ve only used a minimum of 3 simple augmentations in the default TTA code. You can add vertical flip and extra sizes for better final mAP. On Thu, 23 Jul 2020 at 18:47, AlexWang1900 @.**> wrote: here are the results for freeze <4 , it seems overfit at 4. [image: results] https://user-images.githubusercontent.com/60679873/88353945-f9049300-cd91-11ea-90f5-80a2c2aea546.png result_1 without freeze best map0.5-0.75 0.8025 at epoch 54/60. result_2 freeze bset map0.5-0.75 0.7991 at epoch 60/60 both lr cosine decay(end epochs = 80) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#459 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGMXEGMRCCPPAATU52MDC4LR5DRZXANCNFSM4PDBTPBQ . -- https://www.ultralytics.com/ Glenn Jocher*Founder & CEO, Ultralytics LLC +1 301 237 6695 https://www.facebook.com/ultralytics https://www.twitter.com/ultralytics https://www.youtube.com/ultralytics https://www.github.com/ultralytics https://www.linkedin.com/company/ultralytics https://www.instagram.com/ultralytics https://contact.ultralytics.com/

I have added mixup , seems okay now. I also found this: RIFLE: Backpropagation in Depth for Deep Transfer Learning throughRe-Initializing the Fully-connected LayEr https://arxiv.org/pdf/2007.03349.pdf I may try it.

xevolesi commented 4 years ago

@AlexWang1900 , Hi! How do you get mAP.5:.75 0.8 with 60 epochs on wheat dataset? O.o I can't get such mAP even with 210 epochs.

AlexWang1900 commented 4 years ago

you may used map0.5-0.95? another valid set? what is your score?

---Original--- From: "Mark"<notifications@github.com> Date: Fri, Jul 24, 2020 17:46 PM To: "ultralytics/yolov5"<yolov5@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [ultralytics/yolov5] custom anchors get flushed when loading pretrain weights (#459)

@AlexWang1900 , Hi! How do you get mAP.5:.75 0.8 with 60 epochs on wheat dataset? O.o I can't get such mAP even with 210 epochs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xevolesi commented 4 years ago

@AlexWang1900 ,

you may used map0.5-0.95?

I modified this line https://github.com/ultralytics/yolov5/blob/4b5f4806bcd513b18171034c06364432ef2c19c2/test.py#L59 to iouv = torch.linspace(0.5, 0.75, 6).to(device). is this okay?

another valid set?

I used standard 5-fold stratified split from some public notebook, so i think yes our valid sets are different.

what is your score?

My current mAP@.5:.75 for single model 0.73. My current LB: 5 yolov5 models with TTA: 0.7489; I don't want to use pseudo-labeling because it looks like a cheat.

I understand that I will be excluded from the competition for using yolov5, but I'm just wondering how others get such a big score. It's greate to learn from others, i think.

AlexWang1900 commented 4 years ago

I think you are good for 0.7489. my best single model 0.7488.

with some threahold tweeking and plabel you can easily go beyond 0.77

for validation score I didn't change the iouv. I used first6 results from ap[ ]. I also printed them all. there is a thread in this repository. issue 339. I just copied it from Glenn Jocher

I can't see what's the difference between changing Iouv and get from ap[]. from the LB result,we are close,so I think there is an bias between our validation scores

---Original--- From: "Mark"<notifications@github.com> Date: Fri, Jul 24, 2020 18:41 PM To: "ultralytics/yolov5"<yolov5@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [ultralytics/yolov5] custom anchors get flushed when loading pretrain weights (#459)

@AlexWang1900 ,

you may used map0.5-0.95?

I modified this line https://github.com/ultralytics/yolov5/blob/4b5f4806bcd513b18171034c06364432ef2c19c2/test.py#L59 to iouv = torch.linspace(0.5, 0.75, 6).to(device). is this okay?

another valid set?

I used standard 5-fold stratified split from some public notebook, so i think yes our valid sets are different.

what is your score?

My current mAP@.5:.75 for single model 0.73. My current LB: 5 yolov5 models with TTA: 0.7489; I don't want to use pseudo-labeling because it looks like a cheat.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

AlexWang1900 commented 4 years ago

plabel has a paper. "self training with noisy student impoves imagenet classification" it should be legit and nice.

---Original--- From: "Mark"<notifications@github.com> Date: Fri, Jul 24, 2020 18:41 PM To: "ultralytics/yolov5"<yolov5@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"AlexWang1900"<41888506@qq.com>; Subject: Re: [ultralytics/yolov5] custom anchors get flushed when loading pretrain weights (#459)

@AlexWang1900 ,

you may used map0.5-0.95?

I modified this line https://github.com/ultralytics/yolov5/blob/4b5f4806bcd513b18171034c06364432ef2c19c2/test.py#L59 to iouv = torch.linspace(0.5, 0.75, 6).to(device). is this okay?

another valid set?

I used standard 5-fold stratified split from some public notebook, so i think yes our valid sets are different.

what is your score?

My current mAP@.5:.75 for single model 0.73. My current LB: 5 yolov5 models with TTA: 0.7489; I don't want to use pseudo-labeling because it looks like a cheat.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xevolesi commented 4 years ago

Dear @AlexWang1900 , thank you!

I'll try to read plabel paper and try use 339 issue to see for changes betwee mAP.

glenn-jocher commented 4 years ago

I believe this issue is resolved, removing TODO layer.

ultralytics / yolov5