ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.67k stars 16.33k forks source link

try to run parallel thread in processing but get an error #9476

Closed frabob2017 closed 2 years ago

frabob2017 commented 2 years ago

Search before asking

Question

Hello I run this command as instructed in google colab pro. But I got an error. "cp: cannot stat 'runs': No such file or directory", where can I find this file?

Based on this, google colab pro provide 4 GPU. https://stackoverflow.com/questions/60180673/number-of-active-sessions-google-colab-pro

!python -m torch.distributed.run --nproc_per_node 2 train.py --img 512 --batch 128 --data data.yaml --weights yolov5m.pt --cache --nosave --device 0,1

Traceback (most recent call last): File "train.py", line 630, in main(opt) File "train.py", line 512, in main device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov5/utils/torch_utils.py", line 118, in select_device f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)" AssertionError: Invalid CUDA '--device 0,1' requested, use '--device cpu' or pass valid CUDA device(s) train: weights=yolov5m.pt, cfg=, data=data.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=128, imgsz=512, rect=False, resume=False, nosave=True, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest github: up to date with https://github.com/ultralytics/yolov5 ✅ Traceback (most recent call last): File "train.py", line 630, in main(opt) File "train.py", line 512, in main device = select_device(opt.device, batch_size=opt.batch_size) File "/content/yolov5/utils/torch_utils.py", line 118, in select_device f"Invalid CUDA '--device {device}' requested, use '--device cpu' or pass valid CUDA device(s)" AssertionError: Invalid CUDA '--device 0,1' requested, use '--device cpu' or pass valid CUDA device(s) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 614) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 765, in main() File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, *kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main run(args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run )(cmd_args) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 1 (local_rank: 1) exitcode : 1 (pid: 615) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2022-09-18_14:53:55 host : cd045c9b9a39 rank : 0 (local_rank: 0) exitcode : 1 (pid: 614) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

cp: cannot stat 'runs': No such file or directory

Additional

No response

glenn-jocher commented 2 years ago

@frabob2017 user error. Colab instances never provide more than a single GPU, yet you've requested 2. Just follow the default examples in the official notebook: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

frabob2017 commented 2 years ago

@frabob2017 user error. Colab instances never provide more than a single GPU, yet you've requested 2. Just follow the default examples in the official notebook: https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

Good to know that. If cannot use parallel processing in Colab, how can increase speed significantly in google colab?

glenn-jocher commented 2 years ago

@frabob2017 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:

Good luck 🍀 and let us know if you have any other questions!

frabob2017 commented 2 years ago

@frabob2017 👋 Hello! Thanks for asking about training speed issues. YOLOv5 🚀 can be trained on CPU (slowest), single-GPU, or multi-GPU (fastest). If you would like to increase your training speed some options are:

  • Increase --batch-size
  • Reduce --img-size
  • Reduce model size, i.e. from YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s
  • Train with multi-GPU DDP at larger --batch-size
  • Train on cached data: python train.py --cache (RAM caching) or --cache disk (disk caching)
  • Train on faster GPUs, i.e.: P100 -> V100 -> A100
  • Train on free GPU backends with up to 16GB of CUDA memory: Open In Colab Open In Kaggle

Good luck 🍀 and let us know if you have any other questions!

yesterday, I tried cache memory, it is too much memory consuming. Let me try cache disk. Thank you for your great suggestions.

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

pderrenger commented 2 weeks ago

You're welcome! If you encounter any issues with disk caching or have further questions, feel free to ask.