Closed oussaifi-majdi closed 1 year ago
Hi @oussaifi-majdi , To solve your issue, I added new option for resume the model training. I think this will solve your issue. If you have any issues, Please let me know. Thank You.
@naseemap47 Thank you so much for your help with this issue! Your guidance and support were invaluable in resolving the problem. Now I use summary metrics to train the data :
python3 train.py --data /dir/dataset/data.yaml --batch 16 --epoch 120 --model yolo_nas_m --size 640 --resume
but how can I determine the figure for accuracy, precision..etc with tensorboard throughout the training, from the first hours of training to the end when i finish training all epochs.
CHECKPOINT_DIR =? EXPERIMENT_NAME =? %load_ext tensorboard %tensorboard --logdir {CHECKPOINT_DIR}/{EXPERIMENT_NAME} --port 6005 %reload_ext tensorboard
Hi @oussaifi-majdi , I am giving on example. i think this will help you. Example:
python3 train.py --data /dir/dataset/data.yaml --batch 6 --epoch 100 --model yolo_nas_m --size 640 --weight runs/train2/ckpt_latest.pth --resume
thanks sor , but If I resume training later using the --resume option, it may be difficult to get the full figure of precision and accuracy from the first epoch to the end. Is there a solution to get the complete figure?
Hi @oussaifi-majdi , I fixed the issue, you can check now. Thank you for finding this issue. Please let me know. This is fixed your issue. Thank you
@naseemap47 thanks the #46 resume works well but the problem for example if we stop in epochs from 0 to 70 then summarize and continue from 70 to 100. when using tensorboard at the end to display the curves of recal, precision, F1.. . it only displays the last part of training 70 to 100 not from 1 to 100 I found some solution https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/experiment_monitoring.md but it does not work with this project, it is necessary to integrate a method among these methods to make the project the best and differentiate it from the others, it solves a very interesting problem
@oussaifi-majdi Thank you. I will look into it. Thank you for your support.
i'm facing time limitations in Google Colab and need to train my data for 150 epochs, but in 50 epochs colab is termine how to resume from the last saved checkpoint when you restart the Colab session.