ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.36k stars 16.25k forks source link

Easy way to save checkpoints for Colab user #640

Closed TaoXieSZ closed 4 years ago

TaoXieSZ commented 4 years ago

🚀 Feature

It will be more convenient for Colab user to save checkpoints in Google Drive than in yolov5/runs.

My idea

Just change in nearly line 458 to 464 (in my current version):

if not opt.evolve:
        tb_writer = None
        if opt.local_rank in [-1, 0]:
            print('Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/')
           # Change the path here
            tb_writer = SummaryWriter(log_dir=increment_dir('/content/drive/My Drive/yolov5-checkpoints/exp', opt.name))

        train(hyp, opt, device, tb_writer)
glenn-jocher commented 4 years ago

@ChristopherSTAN can you point tensorboard to a google drive folder like you have? That would be really cool, then all of your work is saved and you can keep track of experiments this way.

glenn-jocher commented 4 years ago

This is a really good pro tip for Colab users! Maybe we should add a --log-dir argument to train.py to enable this?

TaoXieSZ commented 4 years ago

@glenn-jocher You remind me about that.

However, I have long time no looking the tensorBoard. I just try and it can only point to the data inside yolov5's folder.

For the argument problem, that's up to you, LOL. And I think it is more convenient for colab users if you do. There is --work-dir argument in mmdetection. It ignites my idea about this and I find it can save checkpoints in larger google drive. image

BTW, in my experience, using tensorboard often slows notebooks and raises disconnection (maybe Google try to avoid over-usage), so I ignore that.

glenn-jocher commented 4 years ago

It does work! Wow, so this is a backdoor to permanence with Colab. You can actually log all of your experiments straight to drive, and then pick up where you left off the next day without having to move any files. This is a real game changer for colab dev work. I'll add a PR for the argparser --logdir argument.

Screen Shot 2020-08-06 at 9 36 34 PM
TaoXieSZ commented 4 years ago

@glenn-jocher It is really amazing!

glenn-jocher commented 4 years ago

All done. Thanks for the great idea @ChristopherSTAN!

TaoXieSZ commented 4 years ago

@glenn-jocher It is just kind of feedback from a deep-user. Expecting for better yolov5 in the future.

BTW, I noticed the default bbox loss is now CIoU, maybe you should update the logging entry. It may raise some confusion.

glenn-jocher commented 4 years ago

@ChristopherSTAN yes, you are correct, it's now CIoU. Yes I need to update the comment to a criterion-agnostic term like 'box' or 'regression'.

glenn-jocher commented 4 years ago

TODO: Update GIoU labels to criteria-agnostic terms.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

maheeeetaaa commented 3 years ago

@glenn-jocher Hello, I have been trying to train yolov5_v4 it seems that the train arguments have changed, before i used to use logdir and then when the training would stop ( because i work on colab) i would run it and it would have picked up from where it started but now, it doesnt! i even set the new weights but the training starts as if there has been no training before, the epoch number doesnt reset but all the map graphs show that the training has started from the beginning. What should I do?

here are my arguments :

!python train.py --img 320 --batch 128 --epochs 200 \ --data /content/YoloV5Data/data.yaml \ --cfg ./models/yolov5s.yaml \ --weights /content/drive/Yolov5S_320/exp5/weights/last.pt\

--project /content/drive/Yolov5S_320/

glenn-jocher commented 3 years ago

@maheeetaaa yes local directly logging structure was unified in https://github.com/ultralytics/yolov5/pull/1377. Training results are saved to runs/train/exp.

You may resume an interrupted training run very simply:

python train.py --resume  # automatically select most recent run
python train.py --resume path/to/last.pt  # manually specify run to resume
Leprechault commented 3 years ago

🚀 Feature

It will be more convenient for Colab user to save checkpoints in Google Drive than in yolov5/runs.

My idea

Just change in nearly line 458 to 464 (in my current version):

if not opt.evolve:
        tb_writer = None
        if opt.local_rank in [-1, 0]:
            print('Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/')
           # Change the path here
            tb_writer = SummaryWriter(log_dir=increment_dir('/content/drive/My Drive/yolov5-checkpoints/exp', opt.name))

        train(hyp, opt, device, tb_writer)

@TaoXieSZ for me as fresh in the subject, I don't understand if your proposed change in lines 458 to 464 is inside model yaml file or another file? Could you please help me?

glenn-jocher commented 3 years ago

@Leprechault runs can be logged anywhere now, so @TaoXieSZ comment is no longer applicable. To long a run to any directory use the --project argument along with the --name argument: python train.py --project runs/train --name exp

Leprechault commented 3 years ago

Thanks very much @glenn-jocher !!!!