Saving Early Stopping Patience Value in last.pt Checkpoint

mabubakarsaleem commented 2 months ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello,

I have a question regarding the checkpointing mechanism in YOLOv5, specifically related to saving and resuming the training process. When training a YOLOv5 model, the last.pt checkpoint saves the model's weights and optimizer state. However, it appears that training process parameters, such as the early stopping patience value, are not included in this checkpoint. If my training is interrupted and I restart from the last.pt checkpoint, does the patience value reset to zero, or does it continue from the previously recorded value?****

Additional

No response

glenn-jocher commented 2 months ago

@mabubakarsaleem hello,

Thank you for your question and for thoroughly searching the issues and discussions beforehand!

Currently, the last.pt checkpoint in YOLOv5 saves the model's weights and optimizer state but does not include training process parameters such as the early stopping patience value. Therefore, if your training is interrupted and you restart from the last.pt checkpoint, the patience value will reset to its initial state rather than continuing from the previously recorded value.

To maintain the early stopping patience value across training sessions, you can manually track this parameter and adjust it when resuming training. Here's a simple way to do this:

Save the Patience Value: Before interrupting the training, save the current patience value to a file.
Load the Patience Value: When resuming training, read the saved patience value and set it accordingly.

Here's a code snippet to illustrate this:

# Save patience value before interrupting training
patience_value = early_stopping.patience
with open('patience_value.txt', 'w') as f:
    f.write(str(patience_value))

# Load patience value when resuming training
with open('patience_value.txt', 'r') as f:
    patience_value = int(f.read())
early_stopping.patience = patience_value

Additionally, I encourage you to verify that you are using the latest versions of torch and the YOLOv5 repository to ensure you have the most up-to-date features and bug fixes. You can update YOLOv5 with the following commands:

git pull  # update YOLOv5
pip install -U torch  # update PyTorch

If you have any further questions or need additional assistance, feel free to ask. The YOLO community and the Ultralytics team are here to help!

github-actions[bot] commented 1 month ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / yolov5