Much more memory used in step 3 when using multi gpus compared to using single gpu

cokuehuang commented 1 year ago

System Info: Memory: 500G GPU: 8 * A100 80G Question: Why using multi gpus in init of DeepSpeedRLHFEngine used much more memroy compared to using single gpu ?

Reproduce: Copy model_load.py to DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning Copy test_model_load.sh to DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node Test with 8 GPUs: cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning bash training_scripts/single_node/test_model_load.sh max memory used: 500G logs:

[2023-05-16 18:41:16,882] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
    [2023-05-16 18:41:17,031] [INFO] [runner.py:541:main] cmd = /opt/conda/envs/dschat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None model_load.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /models/actor_models/llama-13B-lora --critic_model_name_or_path /models/reward_models/llama-7B --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 512 --max_prompt_seq_len 512 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 2 --critic_zero_stage 2 --actor_lora_dim 128 --critic_lora_dim 128 --critic_lora_module_name layers. --actor_lora_module_name layers. --only_optimize_lora --output_dir ./output
    [2023-05-16 18:41:19,234] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
    [2023-05-16 18:41:19,234] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
    [2023-05-16 18:41:19,234] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
    [2023-05-16 18:41:19,234] [INFO] [launch.py:247:main] dist_world_size=8
    [2023-05-16 18:41:19,235] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    [2023-05-16 18:41:23,339] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
    ************************[start] Initializing Actor Model [start] *************************
    [2023-05-16 18:43:03,035] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93127
    [2023-05-16 18:43:06,403] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93128
    [2023-05-16 18:43:09,065] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93129
    [2023-05-16 18:43:09,066] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93130
    [2023-05-16 18:43:12,093] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93131
    [2023-05-16 18:43:14,519] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93132
    [2023-05-16 18:43:17,460] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93133
    [2023-05-16 18:43:20,163] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93134
    [2023-05-16 18:43:23,026] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/envs/dschat/bin/python', '-u', 'model_load.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/models/actor_models/llama-13B-lora', '--critic_model_name_or_path', '/models/reward_models/llama-7B', '--num_padding_at_beginning', '0', '--per_device_train_batch_size', '4', '--per_device_mini_train_batch_size', '4', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '512', '--max_prompt_seq_len', '512', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--disable_actor_dropout', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '2', '--critic_zero_stage', '2', '--actor_lora_dim', '128', '--critic_lora_dim', '128', '--critic_lora_module_name', 'layers.', '--actor_lora_module_name', 'layers.', '--only_optimize_lora', '--output_dir', './output'] exits with return code = -9

Test with 1 GPU: cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning CUDA_VISIBLE_DEVICES=0 bash training_scripts/single_node/test_model_load.sh

max memory used: 80G logs:

[2023-05-16 19:29:44,923] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
    Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1
    [2023-05-16 19:29:45,592] [INFO] [runner.py:541:main] cmd = /opt/conda/envs/dschat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMV19 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None model_load.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /models/actor_models/llama-13B-lora --critic_model_name_or_path /models/reward_models/llama-7B-new --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 512 --max_prompt_seq_len 512 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 2 --critic_zero_stage 2 --actor_lora_dim 128 --critic_lora_dim 128 --critic_lora_module_name layers. --actor_lora_module_name layers. --only_optimize_lora --output_dir ./output
    [2023-05-16 19:29:47,689] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [1]}
    [2023-05-16 19:29:47,689] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
    [2023-05-16 19:29:47,689] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
    [2023-05-16 19:29:47,689] [INFO] [launch.py:247:main] dist_world_size=1
    [2023-05-16 19:29:47,689] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=1
    [2023-05-16 19:29:51,316] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
    ************************[start] Initializing Actor Model [start] *************************
    ...
    *****************[end] Initialized Actor Model [end] (duration: 1162.76s)*****************
    *************************[start] Initializing Ref Model [start] **************************
    ...
    ******************[end] Initialized Ref Model [end] (duration: 100.52s)*******************
    ************************[start] Initializing Critic Model [start] ************************
    ...
    ...
        *****************[end] Initialized Critic Model [end] (duration: 344.25s)*****************
        ************************[start] Initializing Reward Model [start] ************************
        Traceback (most recent call last):
          File "model_load.py", line 352, in <module>
            main()
          File "model_load.py", line 336, in main
            rlhf_engine = DeepSpeedRLHFEngine(
          File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 59, in __init__
            self.reward = self._init_reward(
          File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 269, in _init_reward
            reward_engine, *_ = deepspeed.initialize(model=reward_model,
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
            engine = DeepSpeedEngine(args=args,
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 266, in __init__
            self._configure_distributed_model(model)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1037, in _configure_distributed_model
            self.module.to(self.device)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
            return self._apply(convert)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
            module._apply(fn)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
            module._apply(fn)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
            module._apply(fn)
          [Previous line repeated 2 more times]
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
            param_applied = fn(param)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
            return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
        RuntimeError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 79.15 GiB total capacity; 78.03 GiB already allocated; 51.69 MiB free; 78.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
        [2023-05-16 20:31:26,346] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 121057
        [2023-05-16 20:31:27,001] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/envs/dschat/bin/python', '-u', 'model_load.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/models/actor_models/llama-13B-lora', '--critic_model_name_or_path', '/models/reward_models/llama-7B', '--num_padding_at_beginning', '0', '--per_device_train_batch_size', '4', '--per_device_mini_train_batch_size', '4', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '512', '--max_prompt_seq_len', '512', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--disable_actor_dropout', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '2', '--critic_zero_stage', '2', '--actor_lora_dim', '128', '--critic_lora_dim', '128', '--critic_lora_module_name', 'layers.', '--actor_lora_module_name', 'layers.', '--only_optimize_lora', '--output_dir', './output'] exits with return code = 1

files.zip

jomayeri commented 1 year ago

Hi @cokuehuang thanks for passing the scripts along. I'm investigating a few issues showing similar behavior. I'll update once I have something concrete.

jomayeri commented 1 year ago

Hi @cokuehuang the model paths to huggingface in the scripts are incorrect and causing errors before memory allocation occurs.

ZJXNEFU commented 1 year ago

Same hardware environment，Same problem. I just select a model that has 15B parameters as Actor, but how does the 30B-opt model work properly?

cokuehuang commented 1 year ago

@jomayeri model paths are step1(llama 13b) and step2(llama 7b) results saved in local machine.

yadavpa1 commented 4 months ago

I am facing a similar issue. rank6: Traceback (most recent call last): rank6: File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 671, in

rank6: File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 478, in main rank6: rlhf_engine = DeepSpeedRLHFEngine( rank6: File "/DeepSpeedExamples/applications/DeepSpeed-Chat/dschat/rlhf/rlhf_engine.py", line 50, in init rank6: self.ref = self._init_ref( rank6: File "/DeepSpeedExamples/applications/DeepSpeed-Chat/dschat/rlhf/rlhf_engine.py", line 155, in _init_ref rank6: refengine, * = deepspeed.initialize(model=ref_model,

rank6: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU has a total capacity of 79.15 GiB of which 28.62 MiB is free. Process 3126749 has 57.98 GiB memory in use. Including non-PyTorch memory, this process has 21.12 GiB memory in use. Of the allocated memory 18.41 GiB is allocated by PyTorch, and 102.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

microsoft / DeepSpeedExamples

Much more memory used in step 3 when using multi gpus compared to using single gpu #529