microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.08k stars 1.04k forks source link

[Deepspeed-Chat] OOM issue on opt-1.3B on a 8xV100 machine (8x16GB) #271

Closed kouroshHakha closed 1 year ago

kouroshHakha commented 1 year ago

Hello,

I am testing out the SFT stage of the example on a p3.16xlarge machine. But it OOMs. Is there anything that I is missing from my configs?

Note: I commented out the data so that it picks the small default dataset.

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

# Note that usually LoRA needs to use larger learning rate
OUTPUT_PATH=./output
mkdir -p $OUTPUT_PATH

   # --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP \
   # --data_split 2,4,4 \

deepspeed main.py \
   --model_name_or_path facebook/opt-1.3b \
   --per_device_train_batch_size 8 \
   --per_device_eval_batch_size 8 \
   --max_seq_len 512 \
   --learning_rate 1e-3 \
   --weight_decay 0.1 \
   --num_train_epochs 2 \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --zero_stage 0 \
   --lora_dim 128 \
   --lora_module_name decoder.layers. \
   --only_optimize_lora \
   --deepspeed \
   --output_dir $OUTPUT_PATH \
   &> $OUTPUT_PATH/training.log
mrwyattii commented 1 year ago

Hi @kouroshHakha and thanks for checking out DeepSpeed-Chat! It looks like the p3.16clarge instances have 8xV100 with 16GB of VRAM per GPU, is this correct?

I just tested this script locally on A6000 GPUs and I'm seeing more than 16GB used per GPU. Try lowering the --per_device_*_batch_size (Just some quick tests on my machine showed that reducing to --per_device_train_batch_size 2 should consume less than 16GB per GPU). You can also increase the --zero_stage (to 1, 2, or 3) to improve per-device GPU memory usage.

mrwyattii commented 1 year ago

Enabling --gradient_checkpointing or --only_optimizer_lora can also impact GPU memory usage. However, these settings cannot be enabled at the same time.

https://github.com/microsoft/DeepSpeedExamples/blob/e320e75a38e08ec8634c4787a01f50314ba42353/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L133 https://github.com/microsoft/DeepSpeedExamples/blob/e320e75a38e08ec8634c4787a01f50314ba42353/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L154

kouroshHakha commented 1 year ago

Thanks for getting back. Yes this script is running:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=2
fi
mkdir -p $OUTPUT

#    --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP \
#    --data_split 2,4,4 \
deepspeed main.py \
   --model_name_or_path facebook/opt-1.3b \
   --per_device_train_batch_size 2 \
   --per_device_eval_batch_size 2 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0.1 \
   --num_train_epochs 2 \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --only_optimize_lora \
   --output_dir $OUTPUT #\
   #&> $OUTPUT/training.log
mrwyattii commented 1 year ago

Thank you for confirming the script is running. Closing this issue, but please reopen if you see an OOM again.

Vincent131499 commented 1 year ago

@kouroshHakha Can you give your python environment configuration? My training is always erroor, including python, torch deepspeed transformers, for example: Python3.8. Deepspeed 0.9.0 Thanks.

mrwyattii commented 1 year ago

Hi @Vincent131499 could you please share the error message you see? It will be located in output/actor-model/{model_size}/training.log

FlowInter commented 1 year ago

@mrwyattii Hello,I want to use AWS cloud to train ,But it always error even thought I have already change the --per_device_eval_batch_size 2 and enable the --only_optimize_lora and '--gradient_checkpointing' Can you give me some suggestion? This is my error message in trainig.log torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 400.00 MiB (GPU 0; 14.62 GiB total capacity; 13.59 GiB already allocated; 293.94 MiB free; 14.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-17 08:49:35,240] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 4656 [2023-04-17 08:49:35,240] [ERROR] [launch.py:434:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-13b', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '2', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/ec2-user/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b'] exits with return code = 1

mrwyattii commented 1 year ago

@FlowInter the error message indicates you are trying to run the OPT-13b model: '--model_name_or_path', 'facebook/opt-13b'

You will be unable to run such a large model on V100-16GB. Could you please try with a smaller model (like 1.3b)?

SH0AN commented 1 year ago

May I ask how to solve this problem? (“pip install -r requirements.txt” Command successfully ran ) Output logs: ---=== Running Step 1 ===--- Running: bash /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b Traceback (most recent call last): File "train.py", line 210, in main(args) File "train.py", line 195, in main launch_cmd(args, step_num, cmd) File "train.py", line 175, in launch_cmd raise RuntimeError('\n\n'.join(( RuntimeError: Step 1 exited with non-zero status 1

Launch command: bash /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b

Log output: /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log

Please see our tutorial at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning

Please check that you have installed our requirements: pip install -r requirements.txt

If you are seeing an OOM error, try modifying /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh:

FlowInter commented 1 year ago

@mrwyattii Hello mrwyattii,Thank you for your answer ,I already change the model to "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu " But it still OOM,the error message is torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.62 GiB total capacity; 13.69 GiB already allocated; 103.94 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-18 08:41:48,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17965 [2023-04-18 08:41:48,113] [ERROR] [launch.py:434:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/ec2-user/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

Does it is useful to use --offload, By the way,how can I change the batch_size if I only use single GPU? Thank you very much for your help

mrwyattii commented 1 year ago

@SH0AN can you please share the log output from this file? /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log

mrwyattii commented 1 year ago

@mrwyattii Hello mrwyattii,Thank you for your answer ,I already change the model to "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu " But it still OOM,the error message is torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.62 GiB total capacity; 13.69 GiB already allocated; 103.94 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-18 08:41:48,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17965 [2023-04-18 08:41:48,113] [ERROR] [launch.py:434:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/ec2-user/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

Does it is useful to use --offload, By the way,how can I change the batch_size if I only use single GPU? Thank you very much for your help

You will need to modify the params in the bash script that is being used. For you, that should be DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh

Please pull the latest changes from the DeepSpeedExamples repo. Recent updates provide a more informative error message on what to do when you get OOM errors.

xlinsz commented 1 year ago

for me reduce all the batch size to 1 is working for step 1 and 2, but not for step 3

daminho commented 2 months ago

I finetune OPT 1.3B with Lora (rank 8) on A100 40GB GPU and it's still not enough. Batch size = 1, Grad accumulate steps = 1, eval steps = 1 (at the very lowest settings)