training with more epochs

naseemap47 / YOLO-NAS

Train and Inference your custom YOLO-NAS model by Single Command Line

Apache License 2.0

98 stars 13 forks source link

training with more epochs #44

Closed oussaifi-majdi closed 1 year ago

oussaifi-majdi commented 1 year ago

i'm facing time limitations in Google Colab and need to train my data for 150 epochs, but in 50 epochs colab is termine how to resume from the last saved checkpoint when you restart the Colab session.

naseemap47 commented 1 year ago

Hi @oussaifi-majdi , To solve your issue, I added new option for resume the model training. I think this will solve your issue. If you have any issues, Please let me know. Thank You.

oussaifi-majdi commented 1 year ago

@naseemap47 Thank you so much for your help with this issue! Your guidance and support were invaluable in resolving the problem. Now I use summary metrics to train the data :

python3 train.py --data /dir/dataset/data.yaml --batch 16 --epoch 120 --model yolo_nas_m --size 640 --resume

but how can I determine the figure for accuracy, precision..etc with tensorboard throughout the training, from the first hours of training to the end when i finish training all epochs.

CHECKPOINT_DIR =? EXPERIMENT_NAME =? %load_ext tensorboard %tensorboard --logdir {CHECKPOINT_DIR}/{EXPERIMENT_NAME} --port 6005 %reload_ext tensorboard

naseemap47 commented 1 year ago

Hi @oussaifi-majdi , I am giving on example. i think this will help you. Example:

python3 train.py --data /dir/dataset/data.yaml --batch 6 --epoch 100 --model yolo_nas_m --size 640 --weight runs/train2/ckpt_latest.pth --resume

oussaifi-majdi commented 1 year ago

thanks sor , but If I resume training later using the --resume option, it may be difficult to get the full figure of precision and accuracy from the first epoch to the end. Is there a solution to get the complete figure?

naseemap47 commented 1 year ago

Hi @oussaifi-majdi , I fixed the issue, you can check now. Thank you for finding this issue. Please let me know. This is fixed your issue. Thank you

oussaifi-majdi commented 1 year ago

@naseemap47 thanks the #46 resume works well but the problem for example if we stop in epochs from 0 to 70 then summarize and continue from 70 to 100. when using tensorboard at the end to display the curves of recal, precision, F1.. . it only displays the last part of training 70 to 100 not from 1 to 100 I found some solution https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/experiment_monitoring.md but it does not work with this project, it is necessary to integrate a method among these methods to make the project the best and differentiate it from the others, it solves a very interesting problem

naseemap47 commented 1 year ago

@oussaifi-majdi Thank you. I will look into it. Thank you for your support.