open-mmlab / OpenPCDet

OpenPCDet Toolbox for LiDAR-based 3D Object Detection.
Apache License 2.0
4.62k stars 1.29k forks source link

training on multiple gpu #1122

Closed suvasis closed 1 year ago

suvasis commented 2 years ago

hi,

I have 2 gpu machine. Single GPU training works ok. However for 2 gpus, It seems to be waiting for something for ever. The GPU utilization is 0% for both GPUs.

I am using Kitti dataset as described in the documentation.

The command as shown:

OpenPCDet/tools$ ./scripts/dist_train.sh 2 --batch_size 2 --epochs 1 --cfg_file cfgs/kitti_models/pv_rcnn.yaml

My code base is as of Sep 22, 2022.

Log snippet:

2022-09-23 10:21:21,551 INFO **Start training kitti_models/pv_rcnn(default)** epochs: 0it [00:07, ?it/s] 2022-09-23 10:21:30,211 INFO **End training kitti_models/pv_rcnn(default)**

2022-09-23 10:21:30,212 INFO **Start evaluation kitti_models/pv_rcnn(default)** 2022-09-23 10:21:30,213 INFO Loading KITTI dataset 2022-09-23 10:21:30,304 INFO Total samples for KITTI dataset: 3769 Wait 30 seconds for next check (progress: 0.0 / 0 minutes): /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/default/ckpt

LOG SNIPPET:

NGPUS=2

2022-09-23 10:21:30,212 INFO **Start evaluation kitti_models/pv_rcnn(default)** 2022-09-23 10:21:30,213 INFO Loading KITTI dataset 2022-09-23 10:21:30,304 INFO Total samples for KITTI dataset: 3769 Wait 30 seconds for next check (progress: 0.0 / 0 minutes): /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/defaulWait 30 seconds for next check (progress: 0.5 / 0 minutes): /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/defaulWait 30 seconds for next check (progress: 1.0 / 0 minutes): /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/defaulWait 30 seconds for next check (progress: 1.5 / 0 minutes): /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/defaul

jihanyang commented 2 years ago
2022-09-23 10:21:21,551 INFO Start training kitti_models/pv_rcnn(default)
epochs: 0it [00:07, ?it/s]
2022-09-23 10:21:30,211 INFO End training kitti_models/pv_rcnn(default)

As shown in your log, the training has been finished in previous lauch.

suvasis commented 2 years ago

hi,

After I followed this comment https://github.com/open-mmlab/OpenPCDet/issues/938

Have you tried to comment these two lines to quickly start the training?

#if mp.get_start_method(allow_none=True) is None:
#    mp.set_start_method('spawn')

The training worked only for batch_size 2. If I increase the batch_size to 4 or beyond (the log snippet is attached below). What should I do to fix this?

For batch_size=2, the run the successful. For batch_size=4, the training fails.

///////////////////////////////////////////////////////////////////////////////////////////

batch_size=2

///////////////////////////////////////////////////////////////////////////////////////////

command: (pytorchbuild) minasm@lambda-quad:~/suvasis/tools/pvrcnn/OpenPCDet/tools$ ./scripts/dist_train.sh 2 --batch_size 2 --epochs 10 --cfg_file cfgs/kitti_models/pv_rcnn.yaml

log: Cyclist AP@0.50, 0.25, 0.25: bbox AP:90.7043, 78.4946, 74.1905 bev AP:90.0927, 74.9485, 70.5419 3d AP:90.0911, 74.9443, 70.5419 aos AP:90.47, 76.53, 72.11 Cyclist AP_R40@0.50, 0.25, 0.25: bbox AP:94.0771, 79.2091, 75.3161 bev AP:93.6473, 75.5676, 72.0880 3d AP:93.6469, 75.5664, 72.0693 aos AP:93.83, 77.14, 73.01

2022-09-23 14:07:09,004 INFO Result is save to /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/default/eval/eval_with_train/epoch_10/val 2022-09-23 14:07:09,005 INFO ****Evaluation done.***** 2022-09-23 14:07:09,026 INFO Epoch 10 has been evaluated Wait 30 seconds for next check (progress: 0.0 / 0 minutes): /home/minasm/suvasis/tools/pvrcnn/OpenPCDet/output/kitti_models/pv_rcnn/defaul2022-09-23 14:07:39,058 INFO **End evaluation kitti_models/pv_rcnn(default)** ///////////////////////////////////////////////////////////////////////////////////////////

batch_size=4

///////////////////////////////////////////////////////////////////////////////////////////

command: OpenPCDet/tools$ ./scripts/dist_train.sh 2 --batch_size 4 --epochs 10 --cfg_file cfgs/kitti_models/pv_rcnn.yaml

Log:

pochs: 0%| | 0/10 [00:00<?, ?it/s]2022-09-23 17:46:02,631 INFO **Start training kitti_models/pv_rcnn(default)** epochs: 0%| | 0/10 [00:00<?, ?it/s/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [7, 256, 1], strides() = [256, 1, 256] bucket_view.sizes() = [7, 256, 1], strides() = [256, 1, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1659484806139/work/torch/csrc/distributed/c10d/reducer.cpp:312.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [7, 256, 1], strides() = [256, 1, 256] bucket_view.sizes() = [7, 256, 1], strides() = [256, 1, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1659484806139/work/torch/csrc/distributed/c10d/reducer.cpp:312.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 2022-09-23 17:46:09,141 INFO epoch: 0/10, acc_iter=1, cur_iter=0/928, batch_size=2, time_cost(epoch): 00:06/1:35:24, time_cost(all): 00:06/15:54:01, loss=9.989889144897461, d_time=1.04(1.04), f_time=5.11(5.11), b_time=6.15(6.15), lr=0.0009999999999999992 Traceback (most recent call last): File "/home/minasm/suvasis/tools/pvrcnn/OpenPCDet/tools/train.py", line 221, in main() File "/home/minasm/suvasis/tools/pvrcnn/OpenPCDet/tools/train.py", line 168, in main train_model( File "/home/minasm/suvasis/tools/pvrcnn/OpenPCDet/tools/train_utils/train_utils.py", line 150, in train_model accumulated_iter = train_one_epoch( File "/home/minasm/suvasis/tools/pvrcnn/OpenPCDet/tools/train_utils/train_utils.py", line 54, in train_one_epoch loss.backward() File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 444.00 MiB (GPU 1; 7.80 GiB total capacity; 4.73 GiB already allocated; 418.81 MiB free; 5.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 97166 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 97167) of binary: /home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/bin/python Traceback (most recent call last): File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/minasm/suvasis/tools/anaconda3/envs/pytorchbuild/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-09-23_17:46:15 host : lambda-quad rank : 1 (local_rank: 1) exitcode : 1 (pid: 97167) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ///////////////////////////machine details/////////////////// nvidia-smi Fri Sep 23 17:50:22 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | 30% 39C P8 15W / 220W | 67MiB / 8192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:21:00.0 Off | N/A | | 30% 38C P8 20W / 220W | 5MiB / 8192MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1733 G /usr/lib/xorg/Xorg 56MiB | | 0 N/A N/A 1908 G /usr/bin/gnome-shell 8MiB | | 1 N/A N/A 1733 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ tools$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.