ultralytics / yolov5

YOLOv5 πŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.56k stars 16.31k forks source link

Unable to save best.pt file in run/train/exp/weights folder #3351

Closed karndeepsingh closed 3 years ago

karndeepsingh commented 3 years ago

❔Question

I have been training the yolov5 for my custom dataset but it is unable to save best.pt checkpoint. I trained it almost 3 times, thinking that it is an issue with the notebook. Please, help me to save the best-trained weights. Only last.pt file is getting saved after every training.

And please enlighten your thoughts on best.pt file like what it is? Is it the best-trained weight file or anything else?

Thank you, Karndeep Singh

github-actions[bot] commented 3 years ago

πŸ‘‹ Hello @karndeepsingh, thank you for your interest in πŸš€ YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a πŸ› Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 3 years ago

@karndeepsingh if you use the --nosave flag or the --notest flag then yes only last.pt will be saved, this is the intended behavior.

karndeepsingh commented 3 years ago

Oh ! I am using that flag! Just help me with one more thing, this best.pt file stores best-trained weights right?

karndeepsingh commented 3 years ago

Can you share some link or resource so that I can deploy this trained model and what are files I would be expecting from the yolov5 folder to consider it for production?

glenn-jocher commented 3 years ago

@karndeepsingh see Export and other tutorials below:

YOLOv5 Tutorials

karndeepsingh commented 3 years ago

@karndeepsingh if you use the --nosave flag or the --notest flag then yes only last.pt will be saved, t

@karndeepsingh see Export and other tutorials below:

YOLOv5 Tutorials

Thankyou so much! We can also load our custom trained model using torch.hub.load() function right? So, this can be used directly in production i guess. Correct me if I am wrong.

karndeepsingh commented 3 years ago

One more thing wanna add on this, I am training on multiple GPUs using command: !python -m torch.distributed.launch --nproc_per_node 2 train.py --data coco128.yaml --batch_size 4 --weights yolo5x6.pt

Training get initiated and script starts running and it get stuck after printing details of 1st epoch but script keep on running and no status after 1st epoch. Can help something on this ?

glenn-jocher commented 3 years ago

@karndeepsingh yes, see PyTorch Hub tutorial for details: https://docs.ultralytics.com/yolov5/tutorials/pytorch_hub_model_loading

Regarding your bug question, we've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the πŸ› Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! πŸ˜ƒ

karndeepsingh commented 3 years ago

@karndeepsingh yes, see PyTorch Hub tutorial for details: #36

Regarding your bug question, we've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • βœ… Minimal – Use as little code as possible that still produces the same problem
  • βœ… Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • βœ… Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • βœ… Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • βœ… Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the πŸ› Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! πŸ˜ƒ

Sure, I will take care of these!

I went to these Pytorch tutorials for inferencing, how we can crop the detected classes from the image using this, like how we generally pass --save-crop flag in detect.py file.

glenn-jocher commented 3 years ago

@karndeepsingh

results = model(imgs)
results.crop()
karndeepsingh commented 3 years ago

@karndeepsingh

results = model(imgs)
results.crop()

Awesome!! Thankyou so much for this help! Highly Appreciated!

glenn-jocher commented 3 years ago

@karndeepsingh logging location is indicated before and after training. ALL training results are logged to this directory.

Screenshot 2021-05-31 at 17 42 23
karndeepsingh commented 3 years ago

I have two questions:

  1. How to find the number of an instance of a particular class present in training data? 2.How can we train this Yolo Model on Azure cloud?

Your suggestion would be helpful!!

glenn-jocher commented 3 years ago

See labels.png generated on training start.

For cloud environments see https://pytorch.org/get-started/cloud-partners/

karndeepsingh commented 3 years ago

See labels.png generated on training start.

For cloud environments see https://pytorch.org/get-started/cloud-partners/

Thanks!

karndeepsingh commented 3 years ago

Hello! How data augmentation is taken care of in YoloV5? Just curious to understand.

glenn-jocher commented 3 years ago

@karndeepsingh πŸ‘‹ Hello! Thanks for asking about image augmentation. YOLOv5 πŸš€ applies online imagespace and colorspace augmentations in the trainloader (but not the testloader) to present a new and unique augmented Mosaic (original image + 3 random images) each time an image is loaded for training. Images are never presented twice in the same way.

YOLOv5 augmentation

The hyperparameters used to define these augmentations are in your hyperparameter file (default data/hyp.scratch.yaml) defined when training:

python train.py --hyp hyp.scratch.yaml

https://github.com/ultralytics/yolov5/blob/90b7895d652c3bd3d361b2d6e9aee900fd67f5f7/data/hyp.scratch.yaml#L1-L33

You can view the effect of your augmentation policy in your train_batch*.jpg images once training starts. These images will be in your train logging directory, typically yolov5/runs/train/exp:

train_batch0.jpg shows train batch 0 mosaics and labels:

Good luck and let us know if you have any other questions!

karndeepsingh commented 3 years ago

@glenn-jocher Thanks for the detailed information. So, augumentation are applied automatically or we need to specifically mention this hyperparmeter file while training ?

glenn-jocher commented 3 years ago

@karndeepsingh see train.py argparser for hyp.yaml argument: https://github.com/ultralytics/yolov5/blob/7d3686a686478c78beb2b32cf8a35c1a5dbe81b8/train.py#L452-L489

karndeepsingh commented 3 years ago

Hello, I have trained a model and set a specific threshold such as 0.6 and is able to show prediction on the images with bounding boxes with confidence more than a threshold value. But I want to save images that the model has predicted with low confidence level i.e below the mentioned threshold. Any suggestion how I can achieve this?

Reson for asking this is because I want to do Active Learning to annotate my large dataset. Any help on ACTIVE LEARNING with YOLOV5 would be good !

Thankyou

github-actions[bot] commented 3 years ago

πŸ‘‹ Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 πŸš€ resources:

Access additional Ultralytics ⚑ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐!

priyabratknoldus commented 3 years ago

I am getting the same issue while running for more than 5 epoc .. the best.pt file is not getting generated . could you help me what should I change so that I should get best.pt file for more than 40 epoc

glenn-jocher commented 3 years ago

@priyabratknoldus best.pt is saved every best epoch automatically.

priyabratknoldus commented 3 years ago

@karndeepsingh if you use the --nosave flag or the --notest flag then yes only last.pt will be saved, this is the intended behavior.

but where can we change this code.. on which file ? may I know

priyabratknoldus commented 3 years ago

@priyabratknoldus best.pt is saved every best epoch automatically.

but when I am giving epoc 5 it is saving but for epoc more than 10 the best.pt is not saving

priyabratknoldus commented 3 years ago

and one more question if best.pt is not saving then i think we would not be able to predict a new image i believe

glenn-jocher commented 3 years ago

@priyabratknoldus best.pt is saved on every new best epoch. If you use --nosave or --noval then best.pt will not be saved naturally.

joynjo commented 3 years ago

οΌ glen-jocher how to remove --nosave flag?

glenn-jocher commented 3 years ago

@joynjo I don't understand your question. --nosave is a flag you can choose to use with training. It's off by default.

emailic commented 2 years ago

Hello! I have a similar problem. The best.pt is not saved to the folder it is supposed to be saved, there is only last.pt. I resumed training using the weights saved in .../feature_extraction14 folder, and the best results have occured after resuming the training (it was resumed at epoch 95, and the best results have occured at epoch 134), and were saved to the .../feature_extraction15 folder. nosave flag is set to False. image image These are the parametersI used : train: weights=/content/drive/My Drive/microplasticos/microplasticos_576/feature_extraction14/weights/last.pt, cfg=, data=/content/drive/My Drive/microplasticos/microplasticos_576.yaml, hyp=../../../../drive/My Drive/microplasticos/yolov5-master/data/hyps/hyp.scratch-low.yaml, epochs=250, batch_size=14, imgsz=576, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=/content/drive/My Drive/microplasticos/microplasticos_576, name=feature_extraction, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=40, freeze=[12], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest

glenn-jocher commented 2 years ago

@emailic so you're saying that saving proceeds normally before resuming, after resuming best.pt is no longer saved? How do you know this? Your /weights directory should have both files before and after resuming, so where did your best.pt go then after resuming?

emailic commented 2 years ago

Hi @glenn-jocher ! Thanks for your reply. So before resuming, the last.pt got saved to .../feature_extraction14, which is also where the best.pt is. However, if I'm not mistaken, this is the best.pt relevant to the .../feature_extraction14, which never reached the end(stopped at epoch 95). When I resumed (and finished) the training, the new last.pt got saved to .../feature_extraction15 folder, which is where I also expected the best.pt to be located at, but it's not there.

glenn-jocher commented 2 years ago

@emailic --resume resumes to the same exact directory, it does not create a new directory.

emailic commented 2 years ago

Hi @glenn-jocher , thanks for getting back to me. I actually had some problems resuming the training with --resume, so I resumed it by inserting the last weights obtained(feature_extraction14) in the --weights flag. The training indeed resumed (started iterating from the 95th epoch), and at the end of training you can see the that it's written that the results are saved to feature_extraction15. However, in that folder i can only find the last.pt, can't seem to find best.pt

glenn-jocher commented 2 years ago

@emailic thanks for the info. Is this reproducible? i.e. if you CTRL+C in the middle of training and then --resume from the specific last.pt do you again see a new directory created?

python train.py --epochs 10  # CTRL+C
python train.py --resume runs/train/exp/weights/last.pt
emailic commented 2 years ago

Hi @glenn-jocher , sorry for the delay. If this is not urgent, I will get back to you in a while, really busy with a project now. Take care

glenn-jocher commented 11 months ago

@emailic no worries! Whenever you have the time, feel free to get back to me. Good luck with your project!

khinnnnn commented 7 months ago

❔Question

I have been training the yolov5 for my custom dataset but it is unable to save best.pt checkpoint. I trained it almost 3 times, thinking that it is an issue with the notebook. Please, help me to save the best-trained weights. Only last.pt file is getting saved after every training.

And please enlighten your thoughts on best.pt file like what it is? Is it the best-trained weight file or anything else?

Thank you, Karndeep Singh

Hello Could you explain how you solved this case,please. I have the same problem now.

glenn-jocher commented 7 months ago

@khinnnnn hello! The best.pt file is indeed intended to represent the model weights that achieved the best performance on the validation set during training, according to the metrics being monitored (e.g., mAP). If you're only seeing the last.pt file, it could be due to a few reasons:

  1. Validation Set: Ensure you have a validation set defined in your dataset. The best.pt is determined based on performance on this set. If there's no validation set, the concept of "best" doesn't apply.

  2. Training Configuration: Check your training command and configuration files to ensure they're set up correctly for saving checkpoints beyond just the last one.

  3. Patience Parameter: If you're using early stopping (via the patience parameter in some configurations), ensure it's not set too low, which might be stopping training before significant improvements are seen.

  4. File System Issues: Ensure there's enough disk space and you have the necessary write permissions in the directory where the training outputs are being saved.

  5. Manual Resumption: If you manually resumed training by specifying --weights with the last checkpoint, ensure that the training indeed picks up correctly and that the directory structure for saving checkpoints hasn't been altered unintentionally.

If you're following the standard training procedure without modifications and still facing issues, it might be helpful to share more details about your training command, dataset configuration, and any modifications you've made to the training script or environment. This can provide more context for troubleshooting.

Remember, the key to resolving this is ensuring your validation set is correctly set up and monitored during training, and that your training environment is correctly configured for saving checkpoints.

pderrenger commented 1 week ago

To save images with low confidence predictions, you can modify the detection script to include a lower threshold for saving. For active learning, consider using these low-confidence predictions to identify and annotate uncertain samples. You might also explore integrating with tools like Roboflow for active learning workflows.

pderrenger commented 5 days ago

To save images with low confidence predictions, you can modify the detection script to include a lower threshold for saving images. For active learning, consider using these low-confidence predictions to identify samples for further annotation. You might find integrating a custom script to automate this process helpful.