Closed HIAI2019 closed 4 years ago
Hello @HIAI2019, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com.
python train.py --resume
@glenn-jocher And is --resume
option already working as expected? In this two month old comment you strongly advice not to use it.
@hmoravec --resume if fully functional now, it has graduated into an officially supported feature. You use it by itself with no arguments, or by pointing to a last.pt to resume from:
python train.py --resume # resume from most recent last.pt
python train.py --resume runs/exp0/weights/last.pt # resume from specific weights
@glenn-jocher Great, thanks.
My training crashed during model saving after epoch 33 because of HDD issue. I had saved model after epoch 30, so I resumed the training from this (in fact I replaced last.pt with this because I was not aware it is possible to pass the path to the model but I suppose it does not matter).
But I would expect that the metrics for epochs 31 and 32 would copy their values for epochs 31 and 32 from the first run before restart which did not happen. Is this expected?
@hmoravec not sure what route you used, but the intended workflow is:
python train.py --resume
resume from most recent last.pt, automatically including all associated arguments in 1. No arguments should be passed other than --resume or --resume path/to/last.pt, and no moving or renaming of checkpoints is required.
I see, the problem was that I passed also all arguments from 1. step with --resume
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello, Is it better to train a model for a small number of epochs at a time (~100), and for every new run use the best weight from the previous experiment? Or to set a really high number of epochs (~1000's) and pause/resume the training?
Would the two options come to the same thing?
Thanks!
@imabhijit start with default settings and proceed from there.
@hmoravec not sure what route you used, but the intended workflow is:
- You train any model with any arguments
- Your training stops prematurely for any reason
python train.py --resume
resume from most recent last.pt, automatically including all associated arguments in 1.No arguments should be passed other than --resume or --resume path/to/last.pt, and no moving or renaming of checkpoints is required.
Hi, @glenn-jocher , my triaining command is python train.py --data data/smoke.yaml --cfg models/yolov5x.yaml --weights weights/yolov5x.pt --batch-size 10 --epochs 100 --hyp hyp.custom1.yaml
, after training 100 epochs, i got the latest result in the runs/exp11 folder, if i want to resume the training, can i type python train.py --epochs 20 --resume
instead of just python train.py --resume
, because i think if i don't specify new epoch num, it will train 100 epochs again (which i think it is too much), am i right ?
@wwdok if your training completed successfully then there is nothing to resume.
You can point to any of your trained weights to use as initial weights in a new run.
@glenn-jocher Yes, i just realize what you says, now i run ‘python train.py --data data/smoke.yaml --cfg models/yolov5x.yaml --weights runs/exp11/weights/best.pt --batch-size 10 --epochs 30 --hyp hyp.custom1.yaml’, and it is indeed continuing training on the results of the 100th epoch.
@glenn-jocher Hello, i have a quick question. Can i stop my training prematurely (by pressing CTRL+C in the terminal) and then continue the training using python train.py --resume?
@faldisulistiawan resuming an interrupted run is simple. You have two options available:
python train.py --resume # resume latest training
python train.py --resume path/to/last.pt # specify resume checkpoint
If you started the training multi-GPU then you must continue with the same exact configuration (and vice versa). The equivalent commands are here, assuming you are using 8 GPUs:
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume # resume latest training
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt # specify resume checkpoint
Note that you may not change settings when resuming, you can only resume with the same exact settings (--epochs
, --batch
etc).
@glenn-jocher I see. Thank you so much!
Hi, I have trained a model on a dataset and had to interrupt the training cause the model was not getting any better. Now I need to use this trained model (last.pt) as weights for another training with a completely new dataset and configuration. But when I try to do this, it resumes the training from the previously interrupted epoch, without specifying --resume. How should I proceed with this?
@GeorgeVJose 👋 Hello! Thanks for asking about resuming training. YOLOv5 🚀 Learning Rate (LR) schedulers follow predefined LR curves for the fixed number of --epochs
defined at training start (default=300
), and are designed to fall to a minimum LR on the final epoch for best training results. For this reason you can not modify the number of epochs once training has started.
If your training was interrupted for any reason you may continue where you left off using the --resume
argument. If your training fully completed, you can start a new training from any model using the --weights
argument. Examples:
You may not change settings when resuming, and no additional arguments other than --resume
should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt
in your yolov5/
directory is automatically found and used:
python train.py --resume # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt # specify resume checkpoint
Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:
python -m torch.distributed.run --nproc_per_node 8 train.py --resume # resume latest checkpoint
python -m torch.distributed.run --nproc_per_node 8 train.py --resume path/to/last.pt # specify resume checkpoint
If you would like to start training from a fully trained model, use the --weights
argument, not the --resume
argument:
python train.py --weights path/to/best.pt # start from pretrained model
Good luck 🍀 and let us know if you have any other questions!
@GeorgeVJose 👋 Hello! Thanks for asking about training checkpoints. YOLOv5 🚀 checkpoints should be about 4X the size of final trained checkpoints, as they carry not just a FP16 model, but a FP16 EMA and an FP32 optimizer of the same size as the model (each model parameter has its own FP32 gradient saved within the optimizer). Final checkpoints contain only an FP16 model, with the EMA and optimizer both stripped after the final epoch of training: https://github.com/ultralytics/yolov5/blob/b4a29b5a8d63a8c2d4a8929942b44e8969c5dddd/train.py#L423-L425
The strip_optimizer()
function updates the checkpoint dictionary value for model
, replacing it with ema
, and sets ema
and optimizer
keys to None
, which will reduce the checkpoint size by 3/4.
https://github.com/ultralytics/yolov5/blob/b4a29b5a8d63a8c2d4a8929942b44e8969c5dddd/utils/general.py#L741-L755
Final trained checkpoints after strip_optimizer()
should match the README table sizes below.
You can also run strip_optimizer()
manually on any checkpoint to convert it into finalized weights ready to train a new model:
from utils.general import strip_optimizer
strip_optimizer('path/to/best.pt')
!python train.py --weights path/to/best.pt # use best.pt to train a new model
Good luck 🍀 and let us know if you have any other questions!
Hi @glenn-jocher !! I went through the code and thought the same thing. The training works perfectly now. Thank you for the reply.
Question about resuming (EDIT UPDATE: I'm using Yolo11 and realized im in a Yolo5 place)
I trained a model for like 60 epochs, then had to quit it.
My code looked like this for training
yamlPath= r"C:\Users\andre\Documents\GitHub\Mothbox\AI\mothbox_training.yaml"
results = model.train(data=yamlPath, epochs=100, imgsz=1400, batch=2, device='cuda' ) #lowering batch size cuz GPU ran out of memory, default 16
So then I went t resume the training, using either of these options:
python train.py --resume # resume latest training
python train.py --resume path/to/last.pt # specify resume checkpoint
but when it started with either option, it started from epoch 1 again and went to do 100.
Is that normal? I had thought it would start at 60 and continue to the original 100, but maybe it will just run however many epochs it says in the python script? So if i wanted to only train 100 total, i should have just changed my script before resuming to do 40?
@quitmeyer it seems like you're using YOLOv5 commands in a YOLOv11 context, which might be causing the issue. In YOLOv5, using --resume
should continue from the last saved epoch. Ensure you're using the correct version and commands for your specific YOLO model. If you're using YOLOv5, the --resume
command should work as expected, resuming from the last epoch. Adjust the total epochs in your script to account for the epochs already completed.
If you're using YOLOv5, the
--resume
command should work as expected, resuming from the last epoch. Adjust the total epochs in your script to account for the epochs already completed
For this second part did you mean YoloV11? and if so, can you confirm that Yolov11 the resume command works, i just need to adjust the total epochs? (ie if i want 100 epochs, already ran 20, stopped it, now run with 80 and --resume)
I can only provide guidance on YOLOv5, where the --resume
command continues from the last saved epoch. For YOLOv11, please refer to the specific documentation or support channels for that version.
Thanks!
@quitmeyer you're welcome! If you have any more questions or need further assistance, feel free to ask.
❔Question
How to pause training and restart training
Additional context