open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.21k stars 1.53k forks source link

[Bug] ValueError: Unsupported nproc_per_node value: #2697

Open 1064783536 opened 1 year ago

1064783536 commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

Package Version Source


mmcv 2.0.0 https://github.com/open-mmlab/mmcv mmdet 3.0.0 https://github.com/open-mmlab/mmdetection mmdet3d 1.1.1 /home/aolei/10_1_mmdetection3d/mmdetection3d mmengine 0.7.3 https://github.com/open-mmlab/mmengine

Reproduces the problem - code sample

When I run the command "./tools/dist_train.sh configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py GPUS=2", I get some error "ValueError: Unsupported nproc_per_node value: GPUS=2".

Reproduces the problem - command or script

./tools/dist_train.sh configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py GPUS=2

Reproduces the problem - error message

/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( Traceback (most recent call last): File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/run.py", line 607, in determine_local_world_size return int(nproc_per_node) ValueError: invalid literal for int() with base 10: 'GPUS=2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run config, cmd, cmd_args = config_from_args(args) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/run.py", line 660, in config_from_args nproc_per_node = determine_local_world_size(args.nproc_per_node) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/run.py", line 625, in determine_local_world_size raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}") ValueError: Unsupported nproc_per_node value: GPUS=2

Additional information

No response

sunjiahao1999 commented 1 year ago

The correct command is ./tools/dist_train.sh configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py 2 not GPUS=2

1064783536 commented 1 year ago

The correct command is ./tools/dist_train.sh configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py 2 not GPUS=2

Thank you very much for your reply. I run the command "./tools/dist_train.sh configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py 2", I get a new error as follow:

================================================================================================= /home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--amp] [--auto-scale-lr] [--resume [RESUME]] [--ceph] [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--amp] [--auto-scale-lr] [--resume [RESUME]] [--ceph] [--cfg-options CFG_OPTIONS [CFG_OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: configs/pointpillars/pointpillars_dv_secfpn_8xb6-200e_kitti-3d-3class.py ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 135653) of binary: /home/aolei/anaconda3/envs/mmdetection3d_1/bin/python Traceback (most recent call last): File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/aolei/anaconda3/envs/mmdetection3d_1/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures: [1]: time : 2023-08-25_13:38:07 host : Wslab rank : 1 (local_rank: 1) exitcode : 2 (pid: 135654) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-25_13:38:07 host : Wslab rank : 0 (local_rank: 0) exitcode : 2 (pid: 135653) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html