ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

Transfer learning cannot be resumed? #451

Closed JiahongXue closed 4 years ago

JiahongXue commented 5 years ago

Describe the bug Wasn't entirely sure if this is a bug or just yolo feature. together with transfer flag, the resume flag will not actually resume the process. I did take look at the code and I described the code in the last section.

To Reproduce Steps to reproduce the behavior:

  1. have some training on transfer learning with --transfer flag
  2. Ctrl+C
  3. resume the training by the exactly same command plus --resume flag
  4. starts from epoch number 1.

Expected behavior To resume training from latest.pt

Additional context I looked at the train.py (line 98 to 108), the code specifically did treat if the transfer flag is on, the model will retrain from epoch 1. otherwise will retrain from latest.py. Any chance we can get resume on transfer learning as well?

What I did make it seems working So I copied line 107 chkpt = torch.load(latest, map_location=device) # load checkpoint and replaced into line 98 chkpt = torch.load(weights + 'yolov3-spp.pt', map_location=device) It looks like it is doing transfer learning and start from latest.pt. However, I am not sure if this is doing actually what I expected it to do. Any idea?

glenn-jocher commented 5 years ago

@JiahongXue if you are resuming from a previous checkpoint then there is no transfer learning involved, you are merely resuming training on a previous checkpoint. So to get the expected behavior you would use --resume without --transfer.

glenn-jocher commented 5 years ago

@JiahongXue ah but wait, then this will unfreeze all the layers. Hmm, well if you want to keep the layers frozen while resuming then yes there is no off the shelf option for that.

On the other hand, transfer learning is a but of a buzzword that produces suboptimal models quickly. If you want good results you should simply train as normal from scratch, or from a backbone like darknet53.conv.74, which is the default action when starting training.

JiahongXue commented 5 years ago

Got it, Thanks for the response!

glenn-jocher commented 4 years ago

@JiahongXue transfer learning can be resumed now. For example just specify 'last.pt' as your weights and keep using the --transfer argument, then it will load up your last weights, freeze all the transferred layers, and continue training as usual.

glenn-jocher commented 4 years ago

@JiahongXue to clarify, to start transfer learning: python3 train.py --weights weights/yolov3-spp.pt --transfer

To resume: python3 train.py --weights weights/last.pt --transfer