microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.09k stars 1.04k forks source link

training 12b model seems to require more memory than expected #447

Open ChaoChungWu-Johnson opened 1 year ago

ChaoChungWu-Johnson commented 1 year ago

Describe the bug Hi, I was trying to finetune pythia-12b model via the following code in DeepSpeed-Chat 's step1 code. main.py is from DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py

deepspeed main.py \
   --sft_only_data_path {my_dataset} \
   --data_split 10,0,0 \
   --model_name_or_path EleutherAI/pythia-12b-deduped \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 16  \
   --gradient_accumulation_steps 8 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --lora_dim 64 \
   --gradient_checkpointing \
   --zero_stage 3 \
   --deepspeed \
   --output_dir $OUTPUT_PATH \
   &> $OUTPUT_PATH/training.log

and according to zero3's estimation, the model finetuning should only take resources like

Some weights of the model checkpoint at EleutherAI/pythia-12b-deduped were not used when initializing GPTNeoXModel: ['embed_out.weight']
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 11586M total params, 259M largest layer params.
  per CPU  |  per GPU |   Options
  291.35GB |   0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  517.96GB |   0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  258.98GB |   3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  517.96GB |   3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=0
   11.60GB |  25.25GB | offload_param=none, offload_optimizer=none, zero_init=1
  517.96GB |  25.25GB | offload_param=none, offload_optimizer=none, zero_init=0

since I have 720 cpu RAM and 8* 32GB V100 GPU totally, this spec looks sufficient enough to run even with such small batch size (only 1 now) but I still got OOM error, and memory occupied rate up to almost 100% (30GB~31GB/32GB) for each gpu. Any idea why it consume so much memory?

other alternative I try to deal with OOM: change --gradient_checkpointing into --only_optimize_lora: but this resulted in index error, another bug I guess. the error message of this is quite long, I'll paste it in the additional context.

To Reproduce Steps to reproduce the behavior: just run the code with the above environment setting and code.

Expected behavior Should train succesfully.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 12.0
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? yes, with deepspeed

Docker context Are you using a specific docker image that you can share? No.

Additional context index error I try to use --only_optimize_lora instead of --gradient_checkpointing the message is long, so I paste the latter part of it and remove the repeated message from other subthreads, if you need the whole message, please tell me!

[2023-04-26 15:22:51,673] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
[2023-04-26 15:22:51,673] [INFO] [utils.py:786:see_memory_usage] MA 3.86 GB         Max_MA 4.83 GB         CA 11.34 GB         Max_CA 11 GB
[2023-04-26 15:22:51,674] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 110.99 GB, percent = 14.7%
[2023-04-26 15:22:51,676] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
[2023-04-26 15:22:51,676] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000
Using /home/twsfphn198/.cache/torch_extensions/py38_cu120 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.5340464115142822 seconds
Traceback (most recent call last):
  File "main.py", line 345, in <module>
    main()
  File "main.py", line 290, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
Loading extension module utils...
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
Time to load utils op: 0.10356998443603516 seconds
Traceback (most recent call last):
      File "main.py", line 345, in <module>
optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
    main()
  File "main.py", line 290, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range

[2023-04-26 15:22:59,098] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9668
[2023-04-26 15:22:59,102] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9669
[2023-04-26 15:22:59,104] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9670
[2023-04-26 15:22:59,318] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9671
[2023-04-26 15:22:59,320] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9673
[2023-04-26 15:22:59,321] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9675
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9677
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9679
[2023-04-26 15:22:59,617] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=7', '--sft_only_data_path', 'appier/martechQA', 'sharegpt', '--data_split', '2,4,4', '--model_name_or_path', 'EleutherAI/pythia-12b-deduped', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--only_optimize_lora', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/twsfphn198/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/pythia-12b-deduped'] exits with return code = 1
ChaoChungWu-Johnson commented 1 year ago

hi @mrwyattii any idea about the index error using lora? or what should be the best practice for training 12b models? Since I also failed running the example run_13b.sh successfully as well?

laoda513 commented 1 year ago

May be I'm wrong, but I think the real resourse requests is much bigger than said in doc.

I use a 4*2080ti 22g node to train 1.3B,it costs 32hours for stage1. The vram of gpus are almost full image And this is even under the batchsize with 4 , while in the example, it was set to 8. I also tried in colab with an A100 40, with even per_device_train_batch_size1 and gradient_checkpointing, it threw oom image image

But in the doc, the 1.3B training can be done with a single 48G a6000 within 2 hours, is that possible or I just mess up some important settings?