tinyvision / DAMO-YOLO

DAMO-YOLO: a fast and accurate object detection method with some new techs, including NAS backbones, efficient RepGFPN, ZeroHead, AlignedOTA, and distillation enhancement.
Apache License 2.0
3.75k stars 470 forks source link

[Bug]: #147

Open sceddd opened 2 months ago

sceddd commented 2 months ago

Before Reporting

Search before reporting

OS

Ubuntu

Device

Colab T4

CUDA version

12.2

TensorRT version

No response

Python version

Python 3.10.12

PyTorch version

2.3.0+cu121

torchvision version

0.18.0+cu121

Describe the bug

I'm trying to finetune the model on custom dataset on colab

!cd damo-yolo/ && export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py

and end up with this error:

To Reproduce

  1. follow the README to custom dataset
  2. run this command: export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py

Hyper-parameters/Configs

-f configs/damoyolo_tinynasL20_T.py

nproc_per_node=1

Logs

/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( usage: Damo-Yolo train parser [-h] [-f CONFIG_FILE] [--local_rank LOCAL_RANK] [--tea_config TEA_CONFIG] [--tea_ckpt TEA_CKPT] ... Damo-Yolo train parser: error: unrecognized arguments: --local-rank=0 --node_rank=0 E0625 18:21:09.782000 137977072861184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 9042) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in main() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-25_18:21:09 host : 64f93cf95e56 rank : 0 (local_rank: 0) exitcode : 2 (pid: 9042) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ### Screenshots ![image](https://github.com/tinyvision/DAMO-YOLO/assets/102157918/47bf87d9-019e-49d7-a428-5845878490e6) datasets folder ### Additional _No response_
sceddd commented 2 months ago

it's look like there are some problem with the args. I have to fix the tools/train.py file to run the model image change line number 31: 'local_rank' -> 'local-rank'