Before Reporting

[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
[X] I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting

[X] I have searched the DAMO-YOLO issues and found no similar bugs. 我已经在issue列表中搜索但是没有发现类似的bug报告。

OS

Ubuntu

Device

Colab T4

CUDA version

12.2

TensorRT version

No response

Python version

Python 3.10.12

PyTorch version

2.3.0+cu121

torchvision version

0.18.0+cu121

Describe the bug

I'm trying to finetune the model on custom dataset on colab

!cd damo-yolo/ && export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py

and end up with this error:

To Reproduce

follow the README to custom dataset
run this command: export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py

Hyper-parameters/Configs

-f configs/damoyolo_tinynasL20_T.py

nproc_per_node=1

Logs

/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( usage: Damo-Yolo train parser [-h] [-f CONFIG_FILE] [--local_rank LOCAL_RANK] [--tea_config TEA_CONFIG] [--tea_ckpt TEA_CKPT] ... Damo-Yolo train parser: error: unrecognized arguments: --local-rank=0 --node_rank=0 E0625 18:21:09.782000 137977072861184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 9042) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in main() File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main launch(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-25_18:21:09 host : 64f93cf95e56 rank : 0 (local_rank: 0) exitcode : 2 (pid: 9042) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ### Screenshots ![image](https://github.com/tinyvision/DAMO-YOLO/assets/102157918/47bf87d9-019e-49d7-a428-5845878490e6) datasets folder ### Additional _No response_

tinyvision / DAMO-YOLO

[Bug]: #147