Yolov8 Custom Dataset object detection restart training in another session

NILSHOP commented 1 month ago

Search before asking

[X] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

Hi I am trying to train model for object detection in yolov8 with my custom dataset. I am doing this in Kaggle Notebook. I am using GPU P100. 12 hour is the current runtime limit in Kaggle. But I need to run the training for more than 12 hours. Suppose the training is done for 200 epochs and after that runtime limit has crossed. But I need to train for 300 epochs. How can I resume the training again from that point? The code I am using is in the attached screenshot. main_code

Now what files should I save for resuming the session? And what should be the code for that? Do I need to create another kaggle notebook for that? Thanks in advance.

Additional

No response

github-actions[bot] commented 1 month ago

👋 Hello @NILSHOP, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

pderrenger commented 1 month ago

@NILSHOP to resume training your YOLOv8 model after a runtime limit is reached, you should save the model checkpoint at the end of each session. This will allow you to restart training from the last saved state. In your case, you can save the model weights and optimizer state to a file, and then reload them in a new session to continue training.

Ensure you save the model's state dictionary and optimizer state dictionary. When you restart, load these states back into the model and optimizer, and set the starting epoch accordingly.

You don't need to create a new Kaggle notebook; you can continue in the same one by reloading the saved states. If you encounter any issues, please check if they persist with the latest version of the Ultralytics package.

For detailed steps on saving and loading model checkpoints, refer to the YOLOv8 documentation or the relevant sections in the Ultralytics GitHub repository.

ultralytics / ultralytics