Closed NLCharles closed 3 years ago
Have you sucessfully trained this repo before? I didn't see that other users also report this error.
It cunfuses me too. I trained a model with a single gpu, and things looked pretty good. Did you see anything weird in my script?
No, I didn't see any problems in the script. What's the latest commit number of your OpenPCDet?
I pulled the latest commit: 7bc7e551f963be782aca0cda1e19f20e8b2ac93c, which was updated on Nov 8, 2020. I tested singlegpu training on a 2080Ti , and both singlegpu and multigpu training on a Tesla V100 server with 8gpu cores, each having 32GB video memory. Only multigpu training failed to work. When the script runs to the lines above, all gpus run at 0 utilization rate with around 800 MB video memory occupied. However, CPU cores are fully occupied. I have no idea what the script is doing. I am still working on it. Now I am trying to wrap the model as DataParallel object instead of DistributedDataParallel one. It would be my pleasure if you can share your device settings.
I tested on my DGX station with 8 tesla V100, OS ubuntu16.04 with CUDA=10.0, python=3.7, pytorch=1.3, nvidia driver 418.87.01.
I reinstalled the latest commit, tested your default data and network setting on kitti, and it didn't start training anyway.
The logger stopped at this place:
2020-11-10 19:29:33,801 INFO cfg.EXP_GROUP_PATH: kitti_models
2020-11-10 19:29:33,952 INFO Database filter by min points Car: 14357 => 13532
2020-11-10 19:29:33,952 INFO Database filter by min points Pedestrian: 2207 => 2168
2020-11-10 19:29:33,953 INFO Database filter by min points Cyclist: 734 => 705
2020-11-10 19:29:33,970 INFO Database filter by difficulty Car: 13532 => 10759
2020-11-10 19:29:33,973 INFO Database filter by difficulty Pedestrian: 2168 => 2075
2020-11-10 19:29:33,974 INFO Database filter by difficulty Cyclist: 705 => 581
2020-11-10 19:29:34,072 INFO Loading KITTI dataset
2020-11-10 19:29:34,180 INFO Total samples for KITTI dataset: 3712
I have just tested the latest codes and it could be trained well with mutiple GPUs. Maybe you could try to create a new python enviroment. I have tested it with pytorch=1.1, CUDA=9.0, python=3.7, and pytorch=1.5, CUDA=10.1, python=3.7.
I have just tested the latest codes and it could be trained well with mutiple GPUs. Maybe you could try to create a new python enviroment. I have tested it with pytorch=1.1, CUDA=9.0, python=3.7, and pytorch=1.5, CUDA=10.1, python=3.7.
Ok, I suppose something is wrong with my hardware. I will reconfig my environment, and try to update my video driver too. Thank you for your information!
I use KeyboardInterruption to stop the program when it was in a daze. Here are the traces of the program:
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
process.wait()
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
Apparently, the script is always waiting for something. It would wait for as long as it could do (more than 1 day) until I terminate it. Update: I am looking at pytorch issues to seek any advice. Some people have engaged the same issue.
I have figured out what was going on with the problem. In my script,
python -m torch.distributed.launch --nproc_per_node=6 train.py --launcher pytorch --cfg_file cfgs/my_models/pv_rcnn.yaml
I assigned 6 out of 8 gpus for training. However, the script set CUDA_VISIBLE_DEVICE to all automatically, and waited for the 2 unavailable gpus to load. Unfortunately, the script still hangs when it goes to this place:
2020-11-13 15:50:28,736 INFO **********************Start training my_models/pv_rcnn(default)**********************
epochs: 0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
process.wait()
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1019, in wait
return self._wait(timeout=timeout)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1653, in _wait
(pid, sts) = self._try_wait(0)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1611, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
It seems it is still waiting for something that will never happen.
Hello, Shaoshuai. I have tested your model for quite some time, and I believe the problem is due to the reimplementation of my dataset function, probably the batch generation part. Could you please share what you did with scenes having different numbers of gt_boxes? Unaligned batch size may cause serious problem in parallel training.
I just align them with extra zeros just like here:
https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/datasets/dataset.py#L169-L174
And use it by removing the zeros like here: https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py#L92-L95
Hope it is helpful.
Thank you very much. I suppose you set this method in _DatasetTemplate.collatefn() so I just take the free ride by inheriting your class.
I launched the distributed training without torch.distributed.launch, but initialized process group on a main process. My launcher, in _traindist.py, looks like:
distributed.init_process_group(
backend='nccl',
init_method='tcp://127.0.0.1:18888',
world_size=1,
rank=0,
)
device = torch.device('cuda')
model = Net()
if distributed_is_initialized():
model.to(device)
model = nn.parallel.DistributedDataParallel(model)
## and some proper modification
I wrote a CNN and tested the launcher with it using 1,4,6, and 8 gpus, using mnist dataset function from torchvision, and it clearly worked(for one single machine, meaning world_size=1). What troubles me is the following mistake during training, taking place after the dataset was loaded and both epoch and train sample tqdm bars showed.
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File ".../models/OpenPCDet-master/pcdet/models/detectors/pv_rcnn.py", line 11, in forward
batch_dict = cur_module(batch_dict)
File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File ".../models/OpenPCDet-master/pcdet/models/backbones_3d/pfe/voxel_set_abstraction.py", line 177, in forward
keypoints = self.get_sampled_points(batch_dict)
File ".../models/OpenPCDet-master/pcdet/models/backbones_3d/pfe/voxel_set_abstraction.py", line 147, in get_sampled_points
keypoints = sampled_points[0][cur_pt_idxs[0]].unsqueeze(dim=0)
IndexError: index is out of bounds for dimension with size 0
I find this mistake takes place mostly on a dataset without a proper collate_fn. So my conclusion is the dataset function is flawed. Also, do not forget that even use your default kitti setting, the script will hang if --nproc_per_node is not equal to visible gpus.
Thank you again for your quick and helpful reply. I will try to fix the bugs as much as I can.
Could you print out src_points.shape
, sampled_points.shape
and cur_pt_idxs.shape
before this error line?
It may due to the data of your point cloud scenes, and I didn't check whether the data is valid here.
Hello, I found the following print result by the order:
torch.Size([25222, 3]) torch.Size([1, 25222, 3]) torch.Size([1, 2048])
torch.Size([25222, 3]) torch.Size([1, 0, 3]) torch.Size([1, 2048])
So should I discard some sample?
The second line means that one of your samples doesn't have any points. (bs_mask[:] == 0 for this sample). You need to check your data to see whether there is a scene that doesn't have any points. Maybe I should add an assertion here.
However, I trained the samples with a single gpu and one epoch is finished. That is the worst thing I am afraid of. All parts work fine(supposedly) and a wrong output causes a crash. I resort to nn.DataParallel, and in that maner error messages above are thrown as you see.
I use the dataset manually annotated, which has 400k samples containing 8 catagories, and has been tested by other models. Without distributed training it would be impossible to utilize these data.
Setting visible Device to 1 with nn.DataParallel wrapping the model will produce a correct training sequence:
epochs: 0%| | 0/50 [00:00<?, ?it/s]torch.Size([213432, 3]) torch.Size([1, 126306, 3]) torch.Size([1, 2048]) | 0/18000 [00:00<?, ?it/s]
torch.Size([213432, 3]) torch.Size([1, 87126, 3]) torch.Size([1, 2048])
epochs: 0%| | 0/50 [00:16<?, ?it/s, loss=71.8, lr=0.001]torch.Size([217809, 3]) torch.Size([1, 154030, 3]) torch.Size([1, 2048])[00:12<64:26:42, 12.89s/it, total_it=1]
torch.Size([217809, 3]) torch.Size([1, 63779, 3]) torch.Size([1, 2048])
epochs: 0%| | 0/50 [00:18<?, ?it/s, loss=10.5, lr=0.001]torch.Size([233962, 3]) torch.Size([1, 125349, 3]) torch.Size([1, 2048])[00:15<48:20:26, 9.67s/it, total_it=2]
torch.Size([233962, 3]) torch.Size([1, 108613, 3]) torch.Size([1, 2048])
epochs: 0%| | 0/50 [00:21<?, ?it/s, loss=7.11, lr=0.001]torch.Size([233235, 3]) torch.Size([1, 137783, 3]) torch.Size([1, 2048])[00:17<37:03:39, 7.41s/it, total_it=3]
torch.Size([233235, 3]) torch.Size([1, 95452, 3]) torch.Size([1, 2048])
Maybe you could try to see whether the error occurs at the same sample and draw this sample to see if it is valid, since I still think the problem is correlated with a specific data.
Have same situation, the wait time is too much. But in my server, the multi GPU training started to train finally.
I tried this script
And my model training logger showed that training sequence was stuck. It seems it took forever to proceed after it finished the following line in train.py
No further lines are executed. Is this a bug?