how to resume training if pc power off while training

sovit-123 / fasterrcnn-pytorch-training-pipeline

PyTorch Faster R-CNN Object Detection on Custom Dataset

MIT License

223 stars 75 forks source link

how to resume training if pc power off while training #75

Closed VYRION-Ai closed 1 year ago

VYRION-Ai commented 1 year ago

@sovit-123 i run training on my pc and power was off i need to resume training.

and also sorry if i do training for like 20 epoic and training finished but the result i got is not good how to resume from last weight ,like start from epoic 21 .

sovit-123 commented 1 year ago

@VYRION-Ai You can use the following command: python train.py --model --weights --resume

Be sure to give more than 20 epochs for training this time as the last weights file contains the information about the number of epochs that it has been trained for and won't resume training if given 20 epochs or less.

VYRION-Ai commented 1 year ago

@sovit-123 i got this error python train.py --model fasterrcnn_resnet50_fpn --weights run_normal2/last_model_state.pth --data data.yaml --resume --epochs 30

`device cuda
Creating data loaders
Number of training samples: 41316
Number of validation samples: 3904

Loading pretrained weights...
RESUMING TRAINING...
Traceback (most recent call last):
  File "train.py", line 550, in <module>
    main(args)
  File "train.py", line 322, in main
    if checkpoint['epoch']:
KeyError: 'epoch'`

VYRION-Ai commented 1 year ago

this is line 320 because i do some lines

if checkpoint['epoch']: start_epochs = checkpoint['epoch'] print(f"Resuming from epoch {start_epochs}...")

sovit-123 commented 1 year ago

Which .pth file did you use? Please use 'last_model.pth'.

VYRION-Ai commented 1 year ago

@sovit-123 what is the different between last_model_state.pth and last_model.pth

sovit-123 commented 1 year ago

The code saves three weights:

last_model.pth: This is saved after every epoch and contains all the information including the epochs and optimizer state dictionary. Ideal for resuming training.
last_model_state.pth: This is also saved after every epoch but only saves the model state dictionary (weights). This is ideal if trying to run inference using the latest model. Note that these may not be the best weights.
best_model.pth This model is only saved when an epoch's validation mAP surpasses the last highest mAP. This also contains only the model weights and is the most suitable for running inference for getting good results.

VYRION-Ai commented 1 year ago

@sovit-123 thank you very much , i have more question, what is the best number of epoic i can start with , i have 22k images in folder training for two classes (mask and no mask ) , and this is map.jpg, it seems results is not good map

sovit-123 commented 1 year ago

@VYRION-Ai I would say, the results are not too bad. You are getting more than 85% mAP at 0.50 IoU and around 47% mAP at 0.50:0.95 IoU. However, I can suggest a few things:

The code applies mosaic augmentation (similar to YOLOv5/v8) by default. Try turning it off by passing --no-mosaic. It may improve performance.
After turning off mosaic, you can use additional augmentation using --use-train-aug.

I would suggest starting with the above two. If you get better graphs, please post them here. I would also like to know how the model performs on various datasets out of the box and improve the code even more.