ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.12k stars 16.43k forks source link

Hyperparameter evolution stopped #9697

Closed MinjeongKim03 closed 2 years ago

MinjeongKim03 commented 2 years ago

Search before asking

Question

Stop during the validation process while hyperparameter evolution is in progress with the ev argument. The next screenshot is the log after you press ctrl+c to exit arbitrarily. Do you know why it stops?

screen shot image

Full Log Class Images Instances P R mAP@.5 mAP@.5:.95: 38%|███▊ Traceback (most recent call last): File "train.py", line 630, in main(opt) File "train.py", line 607, in main results = train(hyp.copy(), opt, device, callbacks) File "train.py", line 359, in train compute_loss=compute_loss) File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/media/4tb/minjeong/yolov5/val.py", line 195, in run for batch_i, (im, targets, paths, shapes) in enumerate(pbar): File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/media/4tb/minjeong/yolov5/utils/dataloaders.py", line 169, in iter yield next(self.iterator) File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data idx, data = self._get_data() File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data success, data = self._try_get_data() File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/queue.py", line 179, in get self.not_empty.wait(remaining) File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/threading.py", line 300, in wait gotit = waiter.acquire(True, timeout) KeyboardInterrupt

Additional

No response

glenn-jocher commented 2 years ago

You stopped it with KeyboardInterrupt

MinjeongKim03 commented 2 years ago

Yes, but I stopped for more than 2-3 hours in the previous validation work, so I ended it arbitrarily. I wonder why the work does not end in the vaildation process of the hyperparameter evolution work, but it does not proceed anymore.

glenn-jocher commented 2 years ago

👋 Hello! Thanks for asking about resuming evolution.

Resuming YOLOv5 🚀 evolution is a bit different than resuming a normal training run with python train.py --resume. If you started an evolution run which was interrupted, or finished normally, and you would like to continue for additional generations where you left off, then you pass --resume and specify the --name of the evolution you want to resume, i.e.:

Start Evolution

Assume you evolve YOLOv5s on COCO128 for 2 epochs for 3 generations:

python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 3

If this is your first evolution a new directory runs/evolve/exp will be created to save your results.

# ├── yolov5
#     └── runs
#         └── evolve
#             └── exp  ← evolution saved here

Start a Second Evolution

Now assume you want to start a completely separate evolution: YOLOv5s on VOC for 5 epochs for 3 generations. You simply start evolving, and your new evolution will again be logged to a new directory runs/evolve/exp2:

python train.py --epochs 5 --data VOC.yaml --weights yolov5s.pt --evolve 3

You will now have two evolution runs saved:

# ├── yolov5
#     └── runs
#         └── evolve
#             ├── exp  ← first evolution (COCO128)
#             └── exp2  ← second evolution (VOC)

Notebook example: Open In Colab Open In Kaggle

Screenshot 2021-09-15 at 12 23 13

Resume an Evolution

If you want to resume the first evolution (COCO128 saved to runs/evolve/exp), regardless of whether it was interrupted or finished successfully, then you use the same exact command you started with plus --resume --name exp, passing the additional number of generations you want, i.e. --evolve 30 for 30 more generations:

python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 30 --resume --name exp

Evolution will run for an additional 30 generations and all new results will be appended to the existing runs/evolve/exp/evolve.csv.

Good luck 🍀 and let us know if you have any other questions!

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

Pharaun85 commented 1 year ago

MicrosoftTeams-image Hello I'm running an hyperparameter evolution run and I am facing the same problem. After an indetermined number of evolutions, the train proccess get stucked in the last epoch validation and i have to stop it with keyboardinterrupt. For now, it happened in the 3rd and the 28th train run. Ressuming the hyp evolution in the phase that stopped works by now, but i would like not to beeing lookint the run every two hours in order to know if the run failed.

Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size      96/99      5.95G    0.01084    0.01297  0.0004557         16        640: 100%|██████████| 144/144 [00:15<00:       Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size      97/99      5.95G    0.01084    0.01281  0.0004278          5        640: 100%|██████████| 144/144 [00:15<00:       Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size      98/99      5.95G    0.01049    0.01247   0.000403          8        640: 100%|██████████| 144/144 [00:15<00:       Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size      99/99      5.95G    0.01045    0.01247  0.0004318          5        640: 100%|██████████| 144/144 [00:18<00:                 Class     Images  Instances          P          R      mAP50   mAP50-95:  38%|███▊      | 8/21 [1^[[B^[[B^[[B^CTraceback (most recent call last):  File "train.py", line 640, in <module>    main(opt)  File "train.py", line 615, in main    results = train(hyp.copy(), opt, device, callbacks)  File "train.py", line 352, in train    results, maps, _ = validate.run(data_dict,  File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context    return func(*args, **kwargs)  File "/home/user/yolov5/val.py", line 198, in run    for batch_i, (im, targets, paths, shapes) in enumerate(pbar):  File "/home/user/yolov5/.venv/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__    for obj in iterable:  File "/home/user/yolov5/utils/dataloaders.py", line 172, in __iter__    yield next(self.iterator)  File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__    data = self._next_data()  File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data    idx, data = self._get_data()  File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data    success, data = self._try_get_data()  File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data    data = self._data_queue.get(timeout=timeout)  File "/usr/lib/python3.8/queue.py", line 179, in get    self.not_empty.wait(remaining)  File "/usr/lib/python3.8/threading.py", line 306, in wait    gotit = waiter.acquire(True, timeout)KeyboardInterrupt

glenn-jocher commented 1 year ago

@Pharaun85 👋 Hi there!

It looks like your hyperparameter evolution run is getting stuck during the validation process, causing the need to interrupt it with KeyboardInterrupt. One possible reason for this behavior could be related to the input data or data loading pipeline.

It seems that the issue is occurring during the validation phase when iterating over the data. This could indicate a potential problem with the dataset, data loading, or the data preprocessing steps.

To troubleshoot this issue, you can try the following steps:

  1. Check the integrity of your input data to ensure there are no corruptions or inconsistencies.
  2. Verify that your data loading pipeline is functioning correctly and efficiently.
  3. Confirm that the data preprocessing steps are producing the expected outputs for the model.

By addressing any potential issues in these areas, you may be able to resolve the problem and prevent the need to monitor the run for interruptions.

Feel free to reach out if you need further assistance!