ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.89k stars 16.38k forks source link

How to pause training and restart training #911

Closed HIAI2019 closed 4 years ago

HIAI2019 commented 4 years ago

❔Question

How to pause training and restart training

Additional context

github-actions[bot] commented 4 years ago

Hello @HIAI2019, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

python train.py --resume

hmoravec commented 4 years ago

@glenn-jocher And is --resume option already working as expected? In this two month old comment you strongly advice not to use it.

glenn-jocher commented 4 years ago

@hmoravec --resume if fully functional now, it has graduated into an officially supported feature. You use it by itself with no arguments, or by pointing to a last.pt to resume from:

python train.py --resume  # resume from most recent last.pt
python train.py --resume runs/exp0/weights/last.pt  # resume from specific weights
hmoravec commented 4 years ago

@glenn-jocher Great, thanks.

My training crashed during model saving after epoch 33 because of HDD issue. I had saved model after epoch 30, so I resumed the training from this (in fact I replaced last.pt with this because I was not aware it is possible to pass the path to the model but I suppose it does not matter).

But I would expect that the metrics for epochs 31 and 32 would copy their values for epochs 31 and 32 from the first run before restart which did not happen. Is this expected? image

glenn-jocher commented 4 years ago

@hmoravec not sure what route you used, but the intended workflow is:

  1. You train any model with any arguments
  2. Your training stops prematurely for any reason
  3. python train.py --resume resume from most recent last.pt, automatically including all associated arguments in 1.

No arguments should be passed other than --resume or --resume path/to/last.pt, and no moving or renaming of checkpoints is required.

hmoravec commented 4 years ago

I see, the problem was that I passed also all arguments from 1. step with --resume.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

imabhijit commented 4 years ago

Hello, Is it better to train a model for a small number of epochs at a time (~100), and for every new run use the best weight from the previous experiment? Or to set a really high number of epochs (~1000's) and pause/resume the training?

Would the two options come to the same thing?

Thanks!

glenn-jocher commented 4 years ago

@imabhijit start with default settings and proceed from there.

wwdok commented 4 years ago

@hmoravec not sure what route you used, but the intended workflow is:

  1. You train any model with any arguments
  2. Your training stops prematurely for any reason
  3. python train.py --resume resume from most recent last.pt, automatically including all associated arguments in 1.

No arguments should be passed other than --resume or --resume path/to/last.pt, and no moving or renaming of checkpoints is required.

Hi, @glenn-jocher , my triaining command is python train.py --data data/smoke.yaml --cfg models/yolov5x.yaml --weights weights/yolov5x.pt --batch-size 10 --epochs 100 --hyp hyp.custom1.yaml, after training 100 epochs, i got the latest result in the runs/exp11 folder, if i want to resume the training, can i type python train.py --epochs 20 --resume instead of just python train.py --resume, because i think if i don't specify new epoch num, it will train 100 epochs again (which i think it is too much), am i right ?

glenn-jocher commented 4 years ago

@wwdok if your training completed successfully then there is nothing to resume.

You can point to any of your trained weights to use as initial weights in a new run.

wwdok commented 4 years ago

@glenn-jocher Yes, i just realize what you says, now i run ‘python train.py --data data/smoke.yaml --cfg models/yolov5x.yaml --weights runs/exp11/weights/best.pt --batch-size 10 --epochs 30 --hyp hyp.custom1.yaml’, and it is indeed continuing training on the results of the 100th epoch.

faldisulistiawan commented 3 years ago

@glenn-jocher Hello, i have a quick question. Can i stop my training prematurely (by pressing CTRL+C in the terminal) and then continue the training using python train.py --resume?

glenn-jocher commented 3 years ago

@faldisulistiawan resuming an interrupted run is simple. You have two options available:

python train.py --resume  # resume latest training
python train.py --resume path/to/last.pt  # specify resume checkpoint

If you started the training multi-GPU then you must continue with the same exact configuration (and vice versa). The equivalent commands are here, assuming you are using 8 GPUs:

python -m torch.distributed.launch --nproc_per_node 8 train.py --resume  # resume latest training
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Note that you may not change settings when resuming, you can only resume with the same exact settings (--epochs, --batch etc).

faldisulistiawan commented 3 years ago

@glenn-jocher I see. Thank you so much!

GeorgeVJose commented 2 years ago

Hi, I have trained a model on a dataset and had to interrupt the training cause the model was not getting any better. Now I need to use this trained model (last.pt) as weights for another training with a completely new dataset and configuration. But when I try to do this, it resumes the training from the previously interrupted epoch, without specifying --resume. How should I proceed with this?

glenn-jocher commented 2 years ago

@GeorgeVJose 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs defined at training start (default=300), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.

LR Curves

If your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck 🍀 and let us know if you have any other questions!

glenn-jocher commented 2 years ago

@GeorgeVJose 👋 Hello! Thanks for asking about training checkpoints. YOLOv5 🚀 checkpoints should be about 4X the size of final trained checkpoints, as they carry not just a FP16 model, but a FP16 EMA and an FP32 optimizer of the same size as the model (each model parameter has its own FP32 gradient saved within the optimizer). Final checkpoints contain only an FP16 model, with the EMA and optimizer both stripped after the final epoch of training: https://github.com/ultralytics/yolov5/blob/b4a29b5a8d63a8c2d4a8929942b44e8969c5dddd/train.py#L423-L425

The strip_optimizer() function updates the checkpoint dictionary value for model, replacing it with ema, and sets ema and optimizer keys to None, which will reduce the checkpoint size by 3/4. https://github.com/ultralytics/yolov5/blob/b4a29b5a8d63a8c2d4a8929942b44e8969c5dddd/utils/general.py#L741-L755

Final trained checkpoints after strip_optimizer() should match the README table sizes below.

YOLOv5 Models

You can also run strip_optimizer() manually on any checkpoint to convert it into finalized weights ready to train a new model:

from utils.general import strip_optimizer

strip_optimizer('path/to/best.pt')
!python train.py --weights path/to/best.pt  # use best.pt to train a new model

Good luck 🍀 and let us know if you have any other questions!

GeorgeVJose commented 2 years ago

Hi @glenn-jocher !! I went through the code and thought the same thing. The training works perfectly now. Thank you for the reply.

quitmeyer commented 1 month ago

Question about resuming (EDIT UPDATE: I'm using Yolo11 and realized im in a Yolo5 place)

I trained a model for like 60 epochs, then had to quit it.

My code looked like this for training

yamlPath= r"C:\Users\andre\Documents\GitHub\Mothbox\AI\mothbox_training.yaml"
    results = model.train(data=yamlPath, epochs=100, imgsz=1400, batch=2, device='cuda' ) #lowering batch size cuz GPU ran out of memory, default 16

So then I went t resume the training, using either of these options:

python train.py --resume  # resume latest training
python train.py --resume path/to/last.pt  # specify resume checkpoint

but when it started with either option, it started from epoch 1 again and went to do 100.

Is that normal? I had thought it would start at 60 and continue to the original 100, but maybe it will just run however many epochs it says in the python script? So if i wanted to only train 100 total, i should have just changed my script before resuming to do 40?

pderrenger commented 1 week ago

@quitmeyer it seems like you're using YOLOv5 commands in a YOLOv11 context, which might be causing the issue. In YOLOv5, using --resume should continue from the last saved epoch. Ensure you're using the correct version and commands for your specific YOLO model. If you're using YOLOv5, the --resume command should work as expected, resuming from the last epoch. Adjust the total epochs in your script to account for the epochs already completed.

quitmeyer commented 1 week ago

If you're using YOLOv5, the --resume command should work as expected, resuming from the last epoch. Adjust the total epochs in your script to account for the epochs already completed

For this second part did you mean YoloV11? and if so, can you confirm that Yolov11 the resume command works, i just need to adjust the total epochs? (ie if i want 100 epochs, already ran 20, stopped it, now run with 80 and --resume)

pderrenger commented 1 week ago

I can only provide guidance on YOLOv5, where the --resume command continues from the last saved epoch. For YOLOv11, please refer to the specific documentation or support channels for that version.

quitmeyer commented 1 week ago

Thanks!

pderrenger commented 1 week ago

@quitmeyer you're welcome! If you have any more questions or need further assistance, feel free to ask.