Closed MinjeongKim03 closed 2 years ago
You stopped it with KeyboardInterrupt
Yes, but I stopped for more than 2-3 hours in the previous validation work, so I ended it arbitrarily. I wonder why the work does not end in the vaildation process of the hyperparameter evolution work, but it does not proceed anymore.
👋 Hello! Thanks for asking about resuming evolution.
Resuming YOLOv5 🚀 evolution is a bit different than resuming a normal training run with python train.py --resume
. If you started an evolution run which was interrupted, or finished normally, and you would like to continue for additional generations where you left off, then you pass --resume
and specify the --name
of the evolution you want to resume, i.e.:
Assume you evolve YOLOv5s on COCO128 for 2 epochs for 3 generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 3
If this is your first evolution a new directory runs/evolve/exp
will be created to save your results.
# ├── yolov5
# └── runs
# └── evolve
# └── exp ← evolution saved here
Now assume you want to start a completely separate evolution: YOLOv5s on VOC for 5 epochs for 3 generations. You simply start evolving, and your new evolution will again be logged to a new directory runs/evolve/exp2
:
python train.py --epochs 5 --data VOC.yaml --weights yolov5s.pt --evolve 3
You will now have two evolution runs saved:
# ├── yolov5
# └── runs
# └── evolve
# ├── exp ← first evolution (COCO128)
# └── exp2 ← second evolution (VOC)
If you want to resume the first evolution (COCO128 saved to runs/evolve/exp
), regardless of whether it was interrupted or finished successfully, then you use the same exact command you started with plus --resume --name exp
, passing the additional number of generations you want, i.e. --evolve 30
for 30 more generations:
python train.py --epochs 10 --data coco128.yaml --weights yolov5s.pt --evolve 30 --resume --name exp
Evolution will run for an additional 30 generations and all new results will be appended to the existing runs/evolve/exp/evolve.csv
.
Good luck 🍀 and let us know if you have any other questions!
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
Hello I'm running an hyperparameter evolution run and I am facing the same problem. After an indetermined number of evolutions, the train proccess get stucked in the last epoch validation and i have to stop it with keyboardinterrupt. For now, it happened in the 3rd and the 28th train run. Ressuming the hyp evolution in the phase that stopped works by now, but i would like not to beeing lookint the run every two hours in order to know if the run failed.
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size 96/99 5.95G 0.01084 0.01297 0.0004557 16 640: 100%|██████████| 144/144 [00:15<00: Epoch GPU_mem box_loss obj_loss cls_loss Instances Size 97/99 5.95G 0.01084 0.01281 0.0004278 5 640: 100%|██████████| 144/144 [00:15<00: Epoch GPU_mem box_loss obj_loss cls_loss Instances Size 98/99 5.95G 0.01049 0.01247 0.000403 8 640: 100%|██████████| 144/144 [00:15<00: Epoch GPU_mem box_loss obj_loss cls_loss Instances Size 99/99 5.95G 0.01045 0.01247 0.0004318 5 640: 100%|██████████| 144/144 [00:18<00: Class Images Instances P R mAP50 mAP50-95: 38%|███▊ | 8/21 [1^[[B^[[B^[[B^CTraceback (most recent call last): File "train.py", line 640, in <module> main(opt) File "train.py", line 615, in main results = train(hyp.copy(), opt, device, callbacks) File "train.py", line 352, in train results, maps, _ = validate.run(data_dict, File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/user/yolov5/val.py", line 198, in run for batch_i, (im, targets, paths, shapes) in enumerate(pbar): File "/home/user/yolov5/.venv/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__ for obj in iterable: File "/home/user/yolov5/utils/dataloaders.py", line 172, in __iter__ yield next(self.iterator) File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data idx, data = self._get_data() File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1272, in _get_data success, data = self._try_get_data() File "/home/user/yolov5/.venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.8/queue.py", line 179, in get self.not_empty.wait(remaining) File "/usr/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout)KeyboardInterrupt
@Pharaun85 👋 Hi there!
It looks like your hyperparameter evolution run is getting stuck during the validation process, causing the need to interrupt it with KeyboardInterrupt. One possible reason for this behavior could be related to the input data or data loading pipeline.
It seems that the issue is occurring during the validation phase when iterating over the data. This could indicate a potential problem with the dataset, data loading, or the data preprocessing steps.
To troubleshoot this issue, you can try the following steps:
By addressing any potential issues in these areas, you may be able to resolve the problem and prevent the need to monitor the run for interruptions.
Feel free to reach out if you need further assistance!
Search before asking
Question
Stop during the validation process while hyperparameter evolution is in progress with the ev argument. The next screenshot is the log after you press ctrl+c to exit arbitrarily. Do you know why it stops?
screen shot
Full Log Class Images Instances P R mAP@.5 mAP@.5:.95: 38%|███▊ Traceback (most recent call last): File "train.py", line 630, in
main(opt)
File "train.py", line 607, in main
results = train(hyp.copy(), opt, device, callbacks)
File "train.py", line 359, in train
compute_loss=compute_loss)
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/media/4tb/minjeong/yolov5/val.py", line 195, in run
for batch_i, (im, targets, paths, shapes) in enumerate(pbar):
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/media/4tb/minjeong/yolov5/utils/dataloaders.py", line 169, in iter
yield next(self.iterator)
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
idx, data = self._get_data()
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
success, data = self._try_get_data()
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/qisens/anaconda3/envs/minjeong/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Additional
No response