megvii-research / CVPR2023-UniDistill

CVPR2023 (highlight) - UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View
Apache License 2.0
101 stars 10 forks source link

Multi-GPU training for LiDAR and fusion exp #9

Open Song-Jingyu opened 1 year ago

Song-Jingyu commented 1 year ago

Hi,

Thanks for open-sourcing this work. When I was trying to train the teacher network of LiDAR and fusion I wasn't able to start it with multiple GPU. Single-GPU training works. Multi-GPU training of camera exp works. Here is the error log, which I didn not find very informative.

I already changed the num_workers to be 0 but it did not work. Is there anything significantly different among different modalities? Would you mind providing any insight on why happens? Thanks!

RuntimeError: zero_ /tmp/pip-build-env-__cfq4tn/overlay/lib/python3.6/site-packages/cumm/incl
ude/tensorview/tensor.h 221                                                                  
cuda failed with error 1 invalid argument. use CUDA_LAUNCH_BLOCKING=1 to get correct tracebac
k.                                                                                           

During handling of the above exception, another exception occurred:                          

Traceback (most recent call last):                                                           
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_centerhead_lidar_exp.py", line 35, in <module>        
    run_cli(Exp, "BEVFusion_nuscenes_centerhead_lidar_exp")                                  
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/base_cli.py", line
 58, in run_cli                                                                              
    trainer.fit(model, model.train_dataloader, model.val_dataloader)                         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 741, in fit                                                
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 698, in _call_and_handle_interrupt                         
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())                 
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/training_type/ddp.py", line 533, in reconciliate_processes                   
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {tra
ce}")                                                                                        
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank
: 3                                                                                          
 Traceback (most recent call last):                                                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 685, in _call_and_handle_interrupt                         
    return trainer_fn(*args, **kwargs)                                                       
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 777, in _fit_impl                                          
    self._run(model, ckpt_path=ckpt_path)                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1199, in _run                                              
    self._dispatch()                                                                         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1279, in _dispatch                                         
    self.training_type_plugin.start_training(self)                                           
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/training_type/training_type_plugin.py", line 202, in start_training          
    self._results = trainer.run_stage()                                                      
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1289, in run_stage                                         
    return self._run_train()                                                                 
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/trainer/trainer.py", line 1319, in _run_train                                        
    self.fit_loop.run()                                                                      
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/fit_loop.py", line 234, in advance                                             
    self.epoch_loop.run(data_fetcher)                                                        
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/epoch/training_epoch_loop.py", line 193, in advance                            
    batch_output = self.batch_loop.run(batch, batch_idx)                                     
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/batch/training_batch_loop.py", line 88, in advance                             
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/base.py", line 145, in run                                                     
    self.advance(*args, **kwargs)                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 219, in advance                          
    self.optimizer_idx,                                                                      
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization                
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)                             
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 386, in _optimizer_step                  
    using_lbfgs=is_lbfgs,                                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
    optimizer.step(closure=optimizer_closure)                                                
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 339, in optimizer_step
    self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=closure, **kwargs)
File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/opti
m/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)                                                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/opti
m/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)              
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/auto
grad/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)              
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/opti
m/adamw.py", line 65, in step
    loss = closure()                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure
    closure_result = closure()                
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 160, in __call__
    self._result = self.closure(*args, **kwargs)                                             
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 142, in closure
    step_output = self._step_fn()                                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/loops/optimization/optimizer_loop.py", line 435, in _training_step
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/accelerators/accelerator.py", line 219, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/plugins/training_type/ddp.py", line 439, in training_step
    return self.model(*args, **kwargs)                                                       
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/p
arallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])                                            
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/pytorch_li
ghtning/overrides/base.py", line 81, in forward
    output = self.module.training_step(*inputs, **kwargs)                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_base_exp.py", line 374, in training_step
    ret_dict, tf_dict, _, _, _, _ = self(points, imgs, metas, gt_boxes)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_base_exp.py", line 358, in forward
    return self.model(points, imgs, metas, gt_boxes)                                         
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_centerhead_fusion_exp.py", line 144, in forward
    lidar_output = self.lidar_encoder(lidar_points)                                          
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/exps/multisensor_fusion
/nuscenes/BEVFusion/BEVFusion_nuscenes_base_exp.py", line 76, in forward
    voxels, voxel_coords, voxel_num_points = self.voxelizer(lidar_points)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/torch/nn/m
odules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)                                                    
  File "/home/jingyuso/kd_project/test/CVPR2023-UniDistill/unidistill/data/det3d/preprocess/v
oxelization.py", line 54, in forward
    voxel_output = self.voxel_generator(p)                                                   
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/spconv/pyt
orch/utils.py", line 88, in __call__
    res = self.generate_voxel_with_id(pc, clear_voxels, empty_mean)
  File "/home/jingyuso/miniconda3/envs/unidistill_test/lib/python3.6/site-packages/spconv/pyt
orch/utils.py", line 139, in generate_voxel_with_id
    empty_mean, clear_voxels, stream)                                                        
RuntimeError: zero_ /tmp/pip-build-env-__cfq4tn/overlay/lib/python3.6/site-packages/cumm/incl
ude/tensorview/tensor.h 221
cuda failed with error 1 invalid argument. use CUDA_LAUNCH_BLOCKING=1 to get correct tracebac
k.

Killed 
LutaoChu commented 5 months ago

I ran into the same problem. The little difference is that when in fusion modality and batch=1, multi-GPU training is normal. Do you know how to solve this problem? @Song-Jingyu

Song-Jingyu commented 5 months ago

I ran into the same problem. The little difference is that when in fusion modality and batch=1, multi-GPU training is normal. Do you know how to solve this problem? @Song-Jingyu

I think it turns out to be my server has limited CPU/RAM. I only did a preliminary exploration of this repo :(

LutaoChu commented 5 months ago

Thanks for the response. From the error logs it looks like it should be a GPU memory related issue, why do you say it's because of the LIMITED CPU/RAM?

SivenCapo commented 5 months ago

I ran into the same problem. I tried every single way I know,always failed