[BUG]: loading OPT 66B model - CPU runs out of memory

PurvangL commented 5 days ago

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB. How can I to run OPT-66B benchmark with mentioned resources?

error

WARNING:torch.distributed.run:                                                                                                                                        
*****************************************                                                                                                                             
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal 
performance in your application as needed.                                                                                                                            
*****************************************                                                                                                                             
/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.                                     
  warnings.warn("`config` is deprecated and will be removed soon.")                                                                                                   
[06/25/24 19:04:54] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:67 launch                                
[06/25/24 19:04:55] INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 16                                                   
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51974 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51975 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51976 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51977 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51978 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51980 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51981 closing signal SIGTERM                                                                    
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 51979) of binary: /usr/bin/python                                       
Traceback (most recent call last):                                                                                                                                    
  File "/usr/local/bin/torchrun", line 33, in <module>                                                                                                                
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                    
    return f(*args, **kwargs)                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main                                                                           
    run(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run                                                                            
    elastic_launch(                                                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__                                                              
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent                                                          
    raise ChildFailedError(                                                                                                                                           
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                                    
======================================================                                                                                                                
opt/opt_train_demo.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):

Environment

Docker image : nvcr.io/nvidia/pytorch:23.02-py3 transformers : 4.33 colossalai : 0.3.6

Edenzzzz commented 4 days ago

You can try the lazy init as in here and file in a PR if it works. https://github.com/hpcaitech/ColossalAI/blob/8e718a1421203e0f5607f477e1a998567c70d123/examples/language/llama/benchmark.py#L245

PurvangL commented 2 days ago

Thanks @Edenzzzz for suggestion. I will try. I also have one more question. During evaluation of OPT, eval loss for model trained with hybrid_parallel plugin is 5x larger than gemini plugin. and it's like this for most of the OPT variants. Do you know why?

hpcaitech / ColossalAI