ColossalAI cannot run the shufflenet_v2_x1_0 model as torch do

songyuc commented 2 years ago

🐛 Describe the bug

models.shufflenet_v2_x1_0 can be trained with BATCH_SIZE = 16384, which cannot be run successfully with ColossalAI. The information is below:

(conda-general) user@user:~/research/Experiments/ColossalAI-Examples/image/resnet$ colossalai run --nproc_per_node 1 train.py
[06/16/22 13:30:42] INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:1 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:1 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:2 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:2 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:3 to store
                             for rank: 0                                        
                    ...                                     
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:5 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:6 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:6 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:7 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:7 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:8 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:8 with 1 nodes.        
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/context/paral
                             lel_context.py:521 set_device                      
                    INFO     colossalai - colossalai - INFO: process rank 0 is  
                             bound to device 0                                  
[06/16/22 13:30:43] INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/context/paral
                             lel_context.py:557 set_seed                        
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 0, numpy: 1024, python random: 1024,          
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR:      
                             1024,the default parallel seed is                  
                             ParallelMode.DATA.                                 
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :117 launch                                        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, data parallel size: 1, 
                             pipeline parallel size: 1, tensor parallel size: 1 
Files already downloaded and verified
[06/16/22 13:30:44] INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :266 initialize                                    
                    INFO     colossalai - colossalai - INFO:                    
                             ========== Your Config ========                    
                             {'BATCH_SIZE': 16384,                              
                              'CONFIG': {'fp16': {'mode': <AMP_TYPE.TORCH:      
                             'torch'>}},                                        
                              'NUM_EPOCHS': 200}                                
                             ================================                   

                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :278 initialize                                    
                    INFO     colossalai - colossalai - INFO: cuDNN benchmark =  
                             True, deterministic = False                        
                    WARNING  colossalai - colossalai - WARNING: /home/user/so
                             ftware/python/anaconda/anaconda3/envs/conda-general
                             /lib/python3.10/site-packages/colossalai/initialize
                             .py:304 initialize                                 
                    WARNING  colossalai - colossalai - WARNING: Initializing an 
                             non ZeRO model with optimizer class                
                    WARNING  colossalai - colossalai - WARNING: /home/user/so
                             ftware/python/anaconda/anaconda3/envs/conda-general
                             /lib/python3.10/site-packages/colossalai/initialize
                             .py:436 initialize                                 
                    WARNING  colossalai - colossalai - WARNING: No PyTorch DDP  
                             or gradient handler is set up, please make sure you
                             do not need to all-reduce the gradients after a    
                             training step.                                     
 25%|██▌       | 1/4 [00:05<00:16,  5.59s/it]
Traceback (most recent call last):
  File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 157, in <module>
    main()
  File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 103, in main
    output = engine(img)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 156, in forward
    return self._forward_impl(x)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 147, in _forward_impl
    x = self.stage2(x)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 85, in forward
    out = torch.cat((x1, self.branch2(x2)), dim=1)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 10.76 GiB total capacity; 9.54 GiB already allocated; 9.00 MiB free; 9.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2549731) of binary: /home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python
Fatal Python error: Segmentation fault

Thread 0x00007ff209a3e700 (most recent call first):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 324 in wait
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 600 in wait
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 254 in _run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 946 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 966 in _bootstrap

Current thread 0x00007ff2e1d5a740 (most recent call first):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/torchrun", line 33 in <module>

Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg.lapack_lite, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 22)
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py on 127.0.0.1

Environment

CUDA: 11.4

BoxiangW commented 2 years ago

Hi, could you provide your training code for us to reproduce this bug? Besides, could you double-check your dataset settings?

BoxiangW commented 2 years ago

I have tried our code with a simple change of model from resnet to shufflenet. It takes about 32521MiB withBATCH_SIZE = 16384, and no OOM occurred.

songyuc commented 2 years ago

Hi, @BoxiangW, here is the script as train.py

BoxiangW commented 2 years ago

Hi @songyuc, you can uninstall your current colossalai and install our latest version with

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

# install dependency
pip install -r requirements/requirements.txt

# install colossalai
pip install .

There was a bug in previous release that takes up extra GPU memory. With our latest version, BATCH_SIZE=16384 only takes about 10605MiB. Hope this could solve your issue.

songyuc commented 2 years ago

Hi @songyuc, you can uninstall your current colossalai and install our latest version with
git clone https://github.com/hpcaitech/ColossalAI.git

cd ColossalAI

# install dependency

pip install -r requirements/requirements.txt

# install colossalai

pip install .
There was a bug in previous release that takes up extra GPU memory. With our latest version, BATCH_SIZE=16384 only takes about 10605MiB. Hope this could solve your issue.

Thank you for the guide! I will try it later.

hpcaitech / ColossalAI-Examples

ColossalAI cannot run the shufflenet_v2_x1_0 model as torch do #139

🐛 Describe the bug

Environment