Closed kouroshHakha closed 1 year ago
Hi @kouroshHakha and thanks for checking out DeepSpeed-Chat! It looks like the p3.16clarge
instances have 8xV100 with 16GB of VRAM per GPU, is this correct?
I just tested this script locally on A6000 GPUs and I'm seeing more than 16GB used per GPU. Try lowering the --per_device_*_batch_size
(Just some quick tests on my machine showed that reducing to --per_device_train_batch_size 2
should consume less than 16GB per GPU). You can also increase the --zero_stage
(to 1
, 2
, or 3
) to improve per-device GPU memory usage.
Enabling --gradient_checkpointing
or --only_optimizer_lora
can also impact GPU memory usage. However, these settings cannot be enabled at the same time.
https://github.com/microsoft/DeepSpeedExamples/blob/e320e75a38e08ec8634c4787a01f50314ba42353/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L133 https://github.com/microsoft/DeepSpeedExamples/blob/e320e75a38e08ec8634c4787a01f50314ba42353/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L154
Thanks for getting back. Yes this script is running:
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=2
fi
mkdir -p $OUTPUT
# --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP \
# --data_split 2,4,4 \
deepspeed main.py \
--model_name_or_path facebook/opt-1.3b \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--max_seq_len 512 \
--learning_rate 9.65e-6 \
--weight_decay 0.1 \
--num_train_epochs 2 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--zero_stage $ZERO_STAGE \
--deepspeed \
--only_optimize_lora \
--output_dir $OUTPUT #\
#&> $OUTPUT/training.log
Thank you for confirming the script is running. Closing this issue, but please reopen if you see an OOM again.
@kouroshHakha Can you give your python environment configuration? My training is always erroor, including python, torch deepspeed transformers, for example: Python3.8. Deepspeed 0.9.0 Thanks.
Hi @Vincent131499 could you please share the error message you see? It will be located in output/actor-model/{model_size}/training.log
@mrwyattii Hello,I want to use AWS cloud to train ,But it always error even thought I have already change the --per_device_eval_batch_size 2 and enable the --only_optimize_lora and '--gradient_checkpointing' Can you give me some suggestion? This is my error message in trainig.log torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 400.00 MiB (GPU 0; 14.62 GiB total capacity; 13.59 GiB already allocated; 293.94 MiB free; 14.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-17 08:49:35,240] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 4656 [2023-04-17 08:49:35,240] [ERROR] [launch.py:434:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-13b', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '2', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/ec2-user/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b'] exits with return code = 1
@FlowInter the error message indicates you are trying to run the OPT-13b
model: '--model_name_or_path', 'facebook/opt-13b'
You will be unable to run such a large model on V100-16GB. Could you please try with a smaller model (like 1.3b)?
May I ask how to solve this problem?
(“pip install -r requirements.txt” Command successfully ran )
Output logs:
---=== Running Step 1 ===---
Running:
bash /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
Traceback (most recent call last):
File "train.py", line 210, in
Launch command: bash /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
Log output: /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log
Please see our tutorial at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning
Please check that you have installed our requirements: pip install -r requirements.txt
If you are seeing an OOM error, try modifying /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh:
Reduce --per_device_*_batch_size
Increase --zero_stage {0,1,2,3}
on multi-gpu setups
Enable --gradient_checkpointing
or --only_optimizer_lora
@mrwyattii Hello mrwyattii,Thank you for your answer ,I already change the model to "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu " But it still OOM,the error message is torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.62 GiB total capacity; 13.69 GiB already allocated; 103.94 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-18 08:41:48,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17965 [2023-04-18 08:41:48,113] [ERROR] [launch.py:434:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/ec2-user/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1
Does it is useful to use --offload, By the way,how can I change the batch_size if I only use single GPU? Thank you very much for your help
@SH0AN can you please share the log output from this file? /home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log
@mrwyattii Hello mrwyattii,Thank you for your answer ,I already change the model to "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu " But it still OOM,the error message is torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.62 GiB total capacity; 13.69 GiB already allocated; 103.94 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-18 08:41:48,113] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 17965 [2023-04-18 08:41:48,113] [ERROR] [launch.py:434:sigkill_handler] ['/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/ec2-user/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1
Does it is useful to use --offload, By the way,how can I change the batch_size if I only use single GPU? Thank you very much for your help
You will need to modify the params in the bash script that is being used. For you, that should be DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh
Please pull the latest changes from the DeepSpeedExamples repo. Recent updates provide a more informative error message on what to do when you get OOM errors.
for me reduce all the batch size to 1 is working for step 1 and 2, but not for step 3
I finetune OPT 1.3B with Lora (rank 8) on A100 40GB GPU and it's still not enough. Batch size = 1, Grad accumulate steps = 1, eval steps = 1 (at the very lowest settings)
Hello,
I am testing out the SFT stage of the example on a
p3.16xlarge
machine. But it OOMs. Is there anything that I is missing from my configs?Note: I commented out the data so that it picks the small default dataset.