Transfer Learning Performance

vivian-wong commented 5 years ago

Describe the bug When I train my custom dataset (via transfer learning) over 273 epochs (for example 840 epochs) , I found out that I get different results if I stop at 273 epochs, then do "--resume --epochs 840", versus if I do "--epochs 840"

This is what happens if I set epochs = 840 from the beginning of training. results This is what happens if I resume after epochs 273. results

To Reproduce Steps to reproduce the behavior:

python train.py --data-cfg data/*mydata* --transfer python train.py --data-cfg data/*mydata.data* --resume --epochs 840
python train.py --data-cfg data/*mydata* --transfer --epochs 840

Expected behavior Theoretically both should return the same results. The second scenario should not have converged that early. My conjecture is that it has something to do with the change in learning rate?

Desktop (please complete the following information):

OS: [e.g. iOS]

glenn-jocher commented 5 years ago

@vivian-wong first of all, congratulations on your analysis, you've done a great job of investigating the performance here in different settings.

I think you've inadvertently discovered the same result I've seen earlier, that transfer learning produces worse results vs normal training. What you are seeing in your second set of plots is that all the layers that were frozen before become unfrozen, and are optimized just as in regular training when you manually --resume at epoch 273.

There are no LR issues, the LR is behaving as expected in both cases, reducing by 10x at epochs 218 and 245, which roughly correspond to 400k and 500k batches in darknet. This LR scheduler is set for COCO, you probably want to tune it to your custom dataset.

If past experience is correct, you will obtain the best results by simply training normally, and not messing around with transfer learning. By definition, when you freeze layers, they can not adapt to new data, so transfer learning will never produce results as accurate as normal learning.

glenn-jocher commented 5 years ago

@vivian-wong I investigated further, with fascinating results. I ran our coco_100img.data tutorial with and without transfer learning. When running with transfer learning, the wh losses diverged, so I reduced the wh loss multiplier from 4 to 1 and clamped the prediction outputs to a range of -4 to 4:

lwh += (k * 1) * MSE(pi[..., 2:4].clamp(min=-4, max=4), twh[i])  # wh yolo loss

This resulted now converged (transfer learning from yolov3-spp.pt), and I saved them as results2_100img_tl. Of course, since I modified the loss function I reran the original tutorial using the modified loss (as results2_100img), and plotted the two results against our original tutorial result results_100img. The two conclusions are:

This new loss (blue) unexpectedly performs better than our original loss (green)! I'm not sure if this is due to the clamping or the reduced multiplier, but this is excellent news!!
As expected, transfer learning does not perform as well in the end as regular training. It trains a bit faster earlier on, as you'd expect, but it can not achieve the same accuracy later on naturally, as all of its convolutional layers are locked and not able to respond to training.

results

The same plot zoomed in to the first 50 epochs (you can see here transfer learning in orange does initially perform better). results

To recreate these results simply modify the loss as above and run:

git pull  # Update to latest
rm results*.txt   # WARNING: removes old results
python3 train.py --nosave --data data/coco_100img.data --transfer && mv results.txt results2_100img_tl.txt
python3 train.py --nosave --data data/coco_100img.data && mv results.txt results2_100img.txt
python3 -c "from utils import utils; utils.plot_results()"

rlgalvez commented 5 years ago

@vivian-wong first of all, congratulations on your analysis, you've done a great job of investigating the performance here in different settings.

I think you've inadvertently discovered the same result I've seen earlier, that transfer learning produces worse results vs normal training. What you are seeing in your second set of plots is that all the layers that were frozen before become unfrozen, and are optimized just as in regular training when you manually --resume at epoch 273.

There are no LR issues, the LR is behaving as expected in both cases, reducing by 10x at epochs 218 and 245, which roughly correspond to 400k and 500k batches in darknet. This LR scheduler is set for COCO, you probably want to tune it to your custom dataset.

If past experience is correct, you will obtain the best results by simply training normally, and not messing around with transfer learning. By definition, when you freeze layers, they can not adapt to new data, so transfer learning will never produce results as accurate as normal learning.

@glenn-jocher Thank you for this clear explanation, but may I know the advantage of using transfer learning vs training from scratch? Previously, I thought transfer learning also helps to improve the mAP.

glenn-jocher commented 5 years ago

@rlgalvez transfer learning should get you decent results quickly, as it only needs to develop gradients for the unfrozen layers training requires less resources (i.e. perhaps an edge device).

For training you will always get the best results training all layers normally. See the custom training tutorials.

glenn-jocher commented 5 years ago

this shows the coco_64img.data tutorial starting from a few different options, including transfer learning. Transfer learning as shown below typically freezes the main pretrained weights, which constrains its performance. You can replicate these results with this code and looking at the resultant results.png file.

python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/yolov3-spp.weights --transfer --name yolov3-spp_transfer  # TRANSFER LEARNING COMPARISON
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights '' --name from_scratch
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/darknet53.conv.74 --name darknet53_backbone
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/yolov3-spp.weights --name yolov3-spp_backbone

results_64img

ultralytics / yolov3

Transfer Learning Performance #211