ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
134 stars 13 forks source link

Resuming training at 0 epochs left get errors, YOLOv8 #228

Closed chun92 closed 1 year ago

chun92 commented 1 year ago

Search before asking

HUB Component

Training

Bug

I'm training my model in colab. Because of the unstable server the training stopped after last epochs done. When I tried to resume it, colab give me the following error.

AutoBatch: Using batch-size 12 for CUDA:0 10.55G/14.75G (72%) ✅
optimizer: SGD(lr=0.01) with parameter groups 97 weight(decay=0.0), 104 weight(decay=0.00046875), 103 bias
train: Scanning /content/datasets/bermuda/labels/train.cache... 12000 images, 0 backgrounds, 0 corrupt: 100%|██████████| 12000/12000 [00:00<?, ?it/s]
train: 15.4GB RAM required to cache images with 50% safety margin but only 4.1/12.7GB available, not caching images ⚠️
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
val: Scanning /content/datasets/bermuda/labels/valid.cache... 646 images, 0 backgrounds, 0 corrupt: 100%|██████████| 646/646 [00:00<?, ?it/s]
val: Caching images (0.6GB ram): 100%|██████████| 646/646 [00:02<00:00, 230.49it/s]
Plotting labels to runs/detect/train3/labels.jpg... 
Resuming training from epoch-99.pt from epoch 101 to 100 total epochs
Ultralytics HUB: View model at https://hub.ultralytics.com/models/TWqKelG8p5ImXhq5vAtH 🚀
Image sizes 640 train, 640 val
Using 2 dataloader workers
Logging results to runs/detect/train3
Starting training for 100 epochs...
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
[<ipython-input-5-f9d73d6c19b6>](https://localhost:8080/#) in <cell line: 4>()
      2 
      3 model = YOLO('https://hub.ultralytics.com/models/TWqKelG8p5ImXhq5vAtH')
----> 4 model.train()

2 frames
[/usr/local/lib/python3.9/dist-packages/ultralytics/yolo/engine/model.py](https://localhost:8080/#) in train(self, **kwargs)
    368             self.model = self.trainer.model
    369         self.trainer.hub_session = self.session  # attach optional HUB session
--> 370         self.trainer.train()
    371         # Update model and cfg after training
    372         if RANK in (-1, 0):

[/usr/local/lib/python3.9/dist-packages/ultralytics/yolo/engine/trainer.py](https://localhost:8080/#) in train(self)
    189                 ddp_cleanup(self, str(file))
    190         else:
--> 191             self._do_train(world_size)
    192 
    193     def _setup_ddp(self, world_size):

[/usr/local/lib/python3.9/dist-packages/ultralytics/yolo/engine/trainer.py](https://localhost:8080/#) in _do_train(self, world_size)
    386         if RANK in (-1, 0):
    387             # Do final val with best.pt
--> 388             LOGGER.info(f'\n{epoch - self.start_epoch + 1} epochs completed in '
    389                         f'{(time.time() - self.train_time_start) / 3600:.3f} hours.')
    390             self.final_eval()

UnboundLocalError: local variable 'epoch' referenced before assignment

The hub page show "100% Optimizing weights". But soon it shows disconnected

image

I think that hub tried to resume start_epoch 100, but there's no epoch variable made at trainer.py: 294 line, it makes error. for epoch in range(self.start_epoch, self.epochs):

What can I do in hub to pass this bug and complete the training with Colab? Could you update the version to fix it?

Environment

Colab Plus

Minimal Reproducible Example

No response

Additional

No response

github-actions[bot] commented 1 year ago

👋 Hello @chun92, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more, and see our ⭐️ HUB Guidelines to quickly get started uploading datasets and training YOLO models.

If this is a 🐛 Bug Report, please provide screenshots and steps to recreate your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

glenn-jocher commented 1 year ago

@chun92 thanks for the bug report! Is this reproducible every time you train the same model on this same dataset or did it just happen this once?

It seems like final model upload may have been interrupted, while at the same time leaving nothing to resume since all epochs completed successfully.

glenn-jocher commented 1 year ago

@chun92 ok I've taken a look at trainer.py and attempted a fix in https://github.com/ultralytics/ultralytics/pull/2200/commits/07662cba6c73ffe55977ee5e96273a86afb7cea8

I can't be sure that your resume will work, but I think this particular bug should now be resolved in ultralytics 8.0.86. Please update your package with pip install -U ultralytics and try again, and let us know if this resolves your issue.

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐