Failed to train on multiple gpus!

NLCharles commented 3 years ago

I tried this script

python -m torch.distributed.launch --nproc_per_node=6 train.py --launcher pytorch --cfg_file cfgs/my_models/pv_rcnn.yaml

And my model training logger showed that training sequence was stuck. It seems it took forever to proceed after it finished the following line in train.py

    if args.ckpt is not None:
        it, start_epoch = model.load_params_with_optimizer(args.ckpt, to_cpu=dist, optimizer=optimizer, logger=logger)
        last_epoch = start_epoch + 1
    else:
        ckpt_list = glob.glob(str(ckpt_dir / '*checkpoint_epoch_*.pth'))
        if len(ckpt_list) > 0:
            ckpt_list.sort(key=os.path.getmtime)
            it, start_epoch = model.load_params_with_optimizer(
                ckpt_list[-1], to_cpu=dist, optimizer=optimizer, logger=logger
            )
            last_epoch = start_epoch + 1

No further lines are executed. Is this a bug?

sshaoshuai commented 3 years ago

Have you sucessfully trained this repo before? I didn't see that other users also report this error.

NLCharles commented 3 years ago

It cunfuses me too. I trained a model with a single gpu, and things looked pretty good. Did you see anything weird in my script?

sshaoshuai commented 3 years ago

No, I didn't see any problems in the script. What's the latest commit number of your OpenPCDet?

NLCharles commented 3 years ago

I pulled the latest commit: 7bc7e551f963be782aca0cda1e19f20e8b2ac93c, which was updated on Nov 8, 2020. I tested singlegpu training on a 2080Ti , and both singlegpu and multigpu training on a Tesla V100 server with 8gpu cores, each having 32GB video memory. Only multigpu training failed to work. When the script runs to the lines above, all gpus run at 0 utilization rate with around 800 MB video memory occupied. However, CPU cores are fully occupied. I have no idea what the script is doing. I am still working on it. Now I am trying to wrap the model as DataParallel object instead of DistributedDataParallel one. It would be my pleasure if you can share your device settings.

NLCharles commented 3 years ago

I tested on my DGX station with 8 tesla V100, OS ubuntu16.04 with CUDA=10.0, python=3.7, pytorch=1.3, nvidia driver 418.87.01.

I reinstalled the latest commit, tested your default data and network setting on kitti, and it didn't start training anyway.

The logger stopped at this place:

2020-11-10 19:29:33,801   INFO  cfg.EXP_GROUP_PATH: kitti_models
2020-11-10 19:29:33,952   INFO  Database filter by min points Car: 14357 => 13532
2020-11-10 19:29:33,952   INFO  Database filter by min points Pedestrian: 2207 => 2168
2020-11-10 19:29:33,953   INFO  Database filter by min points Cyclist: 734 => 705
2020-11-10 19:29:33,970   INFO  Database filter by difficulty Car: 13532 => 10759
2020-11-10 19:29:33,973   INFO  Database filter by difficulty Pedestrian: 2168 => 2075
2020-11-10 19:29:33,974   INFO  Database filter by difficulty Cyclist: 705 => 581
2020-11-10 19:29:34,072   INFO  Loading KITTI dataset
2020-11-10 19:29:34,180   INFO  Total samples for KITTI dataset: 3712

sshaoshuai commented 3 years ago

I have just tested the latest codes and it could be trained well with mutiple GPUs. Maybe you could try to create a new python enviroment. I have tested it with pytorch=1.1, CUDA=9.0, python=3.7, and pytorch=1.5, CUDA=10.1, python=3.7.

NLCharles commented 3 years ago

I have just tested the latest codes and it could be trained well with mutiple GPUs. Maybe you could try to create a new python enviroment. I have tested it with pytorch=1.1, CUDA=9.0, python=3.7, and pytorch=1.5, CUDA=10.1, python=3.7.

Ok, I suppose something is wrong with my hardware. I will reconfig my environment, and try to update my video driver too. Thank you for your information!

NLCharles commented 3 years ago

I use KeyboardInterruption to stop the program when it was in a daze. Here are the traces of the program:

File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    process.wait()
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

Apparently, the script is always waiting for something. It would wait for as long as it could do (more than 1 day) until I terminate it. Update: I am looking at pytorch issues to seek any advice. Some people have engaged the same issue.

NLCharles commented 3 years ago

I have figured out what was going on with the problem. In my script,

python -m torch.distributed.launch --nproc_per_node=6 train.py --launcher pytorch --cfg_file cfgs/my_models/pv_rcnn.yaml

I assigned 6 out of 8 gpus for training. However, the script set CUDA_VISIBLE_DEVICE to all automatically, and waited for the 2 unavailable gpus to load. Unfortunately, the script still hangs when it goes to this place:

2020-11-13 15:50:28,736   INFO  **********************Start training my_models/pv_rcnn(default)**********************
epochs:   0%|                                            | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    process.wait()
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)

It seems it is still waiting for something that will never happen.

NLCharles commented 3 years ago

Hello, Shaoshuai. I have tested your model for quite some time, and I believe the problem is due to the reimplementation of my dataset function, probably the batch generation part. Could you please share what you did with scenes having different numbers of gt_boxes? Unaligned batch size may cause serious problem in parallel training.

sshaoshuai commented 3 years ago

I just align them with extra zeros just like here:

https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/datasets/dataset.py#L169-L174

And use it by removing the zeros like here: https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/models/roi_heads/target_assigner/proposal_target_layer.py#L92-L95

Hope it is helpful.

NLCharles commented 3 years ago

Thank you very much. I suppose you set this method in _DatasetTemplate.collatefn() so I just take the free ride by inheriting your class.

I launched the distributed training without torch.distributed.launch, but initialized process group on a main process. My launcher, in _traindist.py, looks like:

    distributed.init_process_group(
        backend='nccl',
        init_method='tcp://127.0.0.1:18888',
        world_size=1,
        rank=0,
    )
device = torch.device('cuda')
    model = Net()
    if distributed_is_initialized():
        model.to(device)
        model = nn.parallel.DistributedDataParallel(model)
## and some proper modification

I wrote a CNN and tested the launcher with it using 1,4,6, and 8 gpus, using mnist dataset function from torchvision, and it clearly worked(for one single machine, meaning world_size=1). What troubles me is the following mistake during training, taking place after the dataset was loaded and both epoch and train sample tqdm bars showed.

IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File ".../models/OpenPCDet-master/pcdet/models/detectors/pv_rcnn.py", line 11, in forward
    batch_dict = cur_module(batch_dict)
  File ".../software/Anaconda3/envs/cu10e2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File ".../models/OpenPCDet-master/pcdet/models/backbones_3d/pfe/voxel_set_abstraction.py", line 177, in forward
    keypoints = self.get_sampled_points(batch_dict)
  File ".../models/OpenPCDet-master/pcdet/models/backbones_3d/pfe/voxel_set_abstraction.py", line 147, in get_sampled_points
    keypoints = sampled_points[0][cur_pt_idxs[0]].unsqueeze(dim=0)
IndexError: index is out of bounds for dimension with size 0

I find this mistake takes place mostly on a dataset without a proper collate_fn. So my conclusion is the dataset function is flawed. Also, do not forget that even use your default kitti setting, the script will hang if --nproc_per_node is not equal to visible gpus.

Thank you again for your quick and helpful reply. I will try to fix the bugs as much as I can.

sshaoshuai commented 3 years ago

Could you print out src_points.shape, sampled_points.shape and cur_pt_idxs.shape before this error line?
It may due to the data of your point cloud scenes, and I didn't check whether the data is valid here.

NLCharles commented 3 years ago

Hello, I found the following print result by the order:

torch.Size([25222, 3]) torch.Size([1, 25222, 3]) torch.Size([1, 2048])         
torch.Size([25222, 3]) torch.Size([1, 0, 3]) torch.Size([1, 2048])

So should I discard some sample?

sshaoshuai commented 3 years ago

The second line means that one of your samples doesn't have any points. (bs_mask[:] == 0 for this sample). You need to check your data to see whether there is a scene that doesn't have any points. Maybe I should add an assertion here.

NLCharles commented 3 years ago

However, I trained the samples with a single gpu and one epoch is finished. That is the worst thing I am afraid of. All parts work fine(supposedly) and a wrong output causes a crash. I resort to nn.DataParallel, and in that maner error messages above are thrown as you see.

I use the dataset manually annotated, which has 400k samples containing 8 catagories, and has been tested by other models. Without distributed training it would be impossible to utilize these data.

Setting visible Device to 1 with nn.DataParallel wrapping the model will produce a correct training sequence:

epochs:   0%|                                                                           | 0/50 [00:00<?, ?it/s]torch.Size([213432, 3]) torch.Size([1, 126306, 3]) torch.Size([1, 2048])             | 0/18000 [00:00<?, ?it/s]
torch.Size([213432, 3]) torch.Size([1, 87126, 3]) torch.Size([1, 2048])
epochs:   0%|                                                      | 0/50 [00:16<?, ?it/s, loss=71.8, lr=0.001]torch.Size([217809, 3]) torch.Size([1, 154030, 3]) torch.Size([1, 2048])[00:12<64:26:42, 12.89s/it, total_it=1]
torch.Size([217809, 3]) torch.Size([1, 63779, 3]) torch.Size([1, 2048])
epochs:   0%|                                                      | 0/50 [00:18<?, ?it/s, loss=10.5, lr=0.001]torch.Size([233962, 3]) torch.Size([1, 125349, 3]) torch.Size([1, 2048])[00:15<48:20:26,  9.67s/it, total_it=2]
torch.Size([233962, 3]) torch.Size([1, 108613, 3]) torch.Size([1, 2048])
epochs:   0%|                                                      | 0/50 [00:21<?, ?it/s, loss=7.11, lr=0.001]torch.Size([233235, 3]) torch.Size([1, 137783, 3]) torch.Size([1, 2048])[00:17<37:03:39,  7.41s/it, total_it=3]
torch.Size([233235, 3]) torch.Size([1, 95452, 3]) torch.Size([1, 2048])

sshaoshuai commented 3 years ago

Maybe you could try to see whether the error occurs at the same sample and draw this sample to see if it is valid, since I still think the problem is correlated with a specific data.

passion3394 commented 3 years ago

Have same situation, the wait time is too much. But in my server, the multi GPU training started to train finally.

open-mmlab / OpenPCDet

Failed to train on multiple gpus! #346