frabob2017 commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello I run this command as instructed in google colab pro. But I got an error. "cp: cannot stat 'runs': No such file or directory", where can I find this file?

Based on this, google colab pro provide 4 GPU. https://stackoverflow.com/questions/60180673/number-of-active-sessions-google-colab-pro

!python -m torch.distributed.run --nproc_per_node 2 train.py --img 512 --batch 128 --data data.yaml --weights yolov5m.pt --cache --nosave --device 0,1

api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 1 (local_rank: 1) exitcode : 1 (pid: 615) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 0 (local_rank: 0) exitcode : 1 (pid: 614) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

cp: cannot stat 'runs': No such file or directory

Additional

No response

glenn-jocher commented 2 years ago

@frabob2017 user error. Colab instances never provide more than a single GPU, yet you've requested 2. Just follow the default examples in the official notebook: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

frabob2017 commented 2 years ago

@frabob2017 user error. Colab instances never provide more than a single GPU, yet you've requested 2. Just follow the default examples in the official notebook: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

Good to know that. If cannot use parallel processing in Colab, how can increase speed significantly in google colab?

glenn-jocher commented 2 years ago

@frabob2017 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:

Increase --batch-size
Reduce --img-size
Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s
Train with multi-GPU DDP at larger --batch-size
Train on cached data: python train.py --cache (RAM caching) or --cache disk (disk caching)
Train on faster GPUs, i.e.: P100 -> V100 -> A100
Train on free GPU backends with up to 16GB of CUDA memory:

Good luck 🍀 and let us know if you have any other questions!

frabob2017 commented 2 years ago

@frabob2017 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:

Increase --batch-size

Reduce --img-size

Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s

Train with multi-GPU DDP at larger --batch-size

Train on cached data: python train.py --cache (RAM caching) or --cache disk (disk caching)

Train on faster GPUs, i.e.: P100 -> V100 -> A100

Train on free GPU backends with up to 16GB of CUDA memory:

Good luck 🍀 and let us know if you have any other questions!

yesterday, I tried cache memory, it is too much memory consuming. Let me try cache disk. Thank you for your great suggestions.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

pderrenger commented 2 weeks ago

You're welcome! If you encounter any issues with disk caching or have further questions, feel free to ask.

ultralytics / yolov5

try to run parallel thread in processing but get an error #9476

Search before asking

Question

!python -m torch.distributed.run --nproc_per_node 2 train.py --img 512 --batch 128 --data data.yaml --weights yolov5m.pt --cache --nosave --device 0,1

train.py FAILED

Failures: [1]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 1 (local_rank: 1) exitcode : 1 (pid: 615) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 0 (local_rank: 0) exitcode : 1 (pid: 614) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Additional