Closed frabob2017 closed 2 years ago
@frabob2017 user error. Colab instances never provide more than a single GPU, yet you've requested 2. Just follow the default examples in the official notebook: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb
@frabob2017 user error. Colab instances never provide more than a single GPU, yet you've requested 2. Just follow the default examples in the official notebook: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb
Good to know that. If cannot use parallel processing in Colab, how can increase speed significantly in google colab?
@frabob2017 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:
--batch-size
--img-size
--batch-size
python train.py --cache
(RAM caching) or --cache disk
(disk caching)Good luck 🍀 and let us know if you have any other questions!
@frabob2017 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:
- Increase
--batch-size
- Reduce
--img-size
- Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s
- Train with multi-GPU DDP at larger
--batch-size
- Train on cached data:
python train.py --cache
(RAM caching) or--cache disk
(disk caching)- Train on faster GPUs, i.e.: P100 -> V100 -> A100
- Train on free GPU backends with up to 16GB of CUDA memory:
Good luck 🍀 and let us know if you have any other questions!
yesterday, I tried cache memory, it is too much memory consuming. Let me try cache disk. Thank you for your great suggestions.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
You're welcome! If you encounter any issues with disk caching or have further questions, feel free to ask.
Search before asking
Question
Hello I run this command as instructed in google colab pro. But I got an error. "cp: cannot stat 'runs': No such file or directory", where can I find this file?
Based on this, google colab pro provide 4 GPU. https://stackoverflow.com/questions/60180673/number-of-active-sessions-google-colab-pro
!python -m torch.distributed.run --nproc_per_node 2 train.py --img 512 --batch 128 --data data.yaml --weights yolov5m.pt --cache --nosave --device 0,1
Traceback (most recent call last): File "train.py", line 630, in
main(opt)
File "train.py", line 512, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/content/yolov5/utils/torch_utils.py", line 118, in select_device
f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)"
AssertionError: Invalid CUDA '--device 0,1' requested, use '--device cpu' or pass valid CUDA device(s)
train: weights=yolov5m.pt, cfg=, data=data.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=128, imgsz=512, rect=False, resume=False, nosave=True, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
File "train.py", line 630, in
main(opt)
File "train.py", line 512, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/content/yolov5/utils/torch_utils.py", line 118, in select_device
f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)"
AssertionError: Invalid CUDA '--device 0,1' requested, use '--device cpu' or pass valid CUDA device(s)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 614) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 765, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, *kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
)(cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: [1]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 1 (local_rank: 1) exitcode : 1 (pid: 615) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 0 (local_rank: 0) exitcode : 1 (pid: 614) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
cp: cannot stat 'runs': No such file or directory
Additional
No response