ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.82k stars 16.37k forks source link

saving and loading yolov5 model #2922

Closed care55 closed 3 years ago

care55 commented 3 years ago

❔Question

hi , I'm trying to save my trained model in yolov5 to load it in another session and trained the model from the epoch it stopped how can I save this in a model

!python train.py --img 416 --batch 16 --epochs 1 --data '../data.yaml' --cfg ./models/custom_yolov5s.yaml --weights '' --name yolov5s_results --cache

to call it in: torch.save(model.state_dict(), path) and after saving it how can I load it?

Additional context

note : I want to save my work in drive and load it from it too I used this code to train my data: https://colab.research.google.com/drive/1gDZ2xcTOgR39tGGs-EZ6i3RTs16wmzZQ#scrollTo=wbvMlHd_QwMG thanks...

github-actions[bot] commented 3 years ago

πŸ‘‹ Hello @care55, thank you for your interest in πŸš€ YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a πŸ› Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

care55 commented 3 years ago

@glenn-jocher can you help me please?

wudashuo commented 3 years ago

models are automatically saved in runs/expXX/weights/, if you want to use it in another place, you can load it by model = torch.hub.load('ultralytics/yolov5', 'custom', path_or_model='your.pt')

pravastacaraka commented 3 years ago

@care55 maybe you can try !python train.py --resume

care55 commented 3 years ago

Thanks @wudashuo I try it but I have this error Screenshot (3935) Screenshot (3936)

care55 commented 3 years ago

@pravastacaraka !python train.py --resume runs/train/yolov5s_results/weights/last.pt I tried this but when I trained a lot of epochs maybe 2000 or more all cells after a period of time shutted down for no reason so I have to restart all cells to get the runs file and all of my work is lost

How can I fix this issue? can I fix this problem by using !python train.py --resume without specify the path?

glenn-jocher commented 3 years ago

@care55 if your training was interrupted for any reason you may continue where you left off using the --resume command. If your training fully completed then you can start a new training starting from a fully trained model using the --weights command. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.launch --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck and let us know if you have any other questions!

care55 commented 3 years ago

@glenn-jocher Thanks ,so now I have to put instead of quotation in --weights '' the path of my trained weights like this: !python train.py --img 416 --batch 16 --epochs 1 --data '../data.yaml' --cfg ./models/custom_yolov5s.yaml --weights path/to/best.pt --name yolov5s_results --cache Right?

glenn-jocher commented 3 years ago

@care55 yes, though note that if you specify --weights then --cfg is not needed.

care55 commented 3 years ago

I appreciate your help, that works Thanks @glenn-jocher

care55 commented 3 years ago

@glenn-jocher I have more questions if you can help please ... Can I train yolov5 model with coco weights ?and if I could how to make this? Can you please suggest the best weights to train yolov5 model for object detection?

glenn-jocher commented 3 years ago

@care55 yes you can start training from any pretreind YOLOv5 weights using the --weights command. I would recommend you start from the Train Custom Data tutorial which answers many of these questions:

YOLOv5 Tutorials

john8822 commented 3 years ago

So @glenn-jocher can I train yolov5x the first time for 100 epochs and save it in drive then pass these weights in --weight and train it with another 100 epochs so it becomes 200 epochs because every 50 epochs in yolov5x take almost two hours to complete and I can't keep my laptop opened all the day and the gpu will be full, is that possible or the training must be continue without interrupting?

Because my data is about 4000 images and I tried to do this with yolov5x ...I have reached to 400 epochs by this technic but I don't find much improvement than 50 epochs when I see my images how many epochs I have to train to reach a good accuracy ? And is there any technic while training to see that my model is improving or only from the images I can see the results ? Thanks....

glenn-jocher commented 3 years ago

@john8822 if your training was interrupted for any reason you may continue where you left off using the --resume argument. If your training fully completed, you can start a new training from any model using the --weights argument. Examples:

Resume Single-GPU

You may not change settings when resuming, and no additional arguments other than --resume should be passed, with an optional path to the checkpoint you'd like to resume from. If no checkpoint is passed the most recently updated last.pt in your yolov5/ directory is automatically found and used:

python train.py --resume  # automatically find latest checkpoint (searches yolov5/ directory)
python train.py --resume path/to/last.pt  # specify resume checkpoint

Resume Multi-GPU

Multi-GPU DDP trainings must be resumed with the same GPUs and DDP command, i.e. assuming 8 GPUs:

python -m torch.distributed.launch --nproc_per_node 8 train.py --resume  # resume latest checkpoint
python -m torch.distributed.launch --nproc_per_node 8 train.py --resume path/to/last.pt  # specify resume checkpoint

Start from Pretrained

If you would like to start training from a fully trained model, use the --weights argument, not the --resume argument:

python train.py --weights path/to/best.pt  # start from pretrained model

Good luck and let us know if you have any other questions!

john8822 commented 3 years ago

@glenn-jocher Is there any technic while training to see that my model is improving or only when training is completed I can see the result from the images?

glenn-jocher commented 3 years ago

@john8822 see W&B Logging tutorial:

YOLOv5 Tutorials