Closed Dominic789654 closed 1 year ago
same question
Have you solved this problem?
Can you please share a stack trace?
Also, please try setting all pin_memory
to false.
Can you please share a stack trace?
Also, please try setting all
pin_memory
to false.
I think the OOM issue on the RAM is happening because DeepSpeed is trying to load eight models at the same time, which is causing the CPU memory to not have enough space for offloading. Is there a way in DeepSpeed to set arguments to load the models one by one?
@Dominic789654, what you suggest is theoretically possible. However, without seeing the code, it is unclear to me whether DeepSpeed is actually loading the checkpoints, as opposed to HF for example. So, a stack trace at the minimum would be helpful to understand what is actually going on. Thanks!
@Dominic789654 you may try my latest PR https://github.com/microsoft/DeepSpeed/pull/3629 This patch would allow loading checkpoint in serial way, so that it would not lead to memory peak for resume from the checkpoint training.
@tjruwase Almost the same setting (finetuning llama 33b on 8*A100 40G, 670G RAM). It looks like it reports CUDA OOM while moving the model to GPUs (33B requires at least 66GB memory).
Neither stage3_max_live_parameters
nor offloading (to cpu or nvme) matters.
For some reason, engine.py L1048 is_zero3_model
is False even I set it True in config.
Initializing deepspeed took 18.02s
Traceback (most recent call last):
File "train_deepspeed.py", line 323, in train
model_engine, optimizer, _, scheduler = deepspeed.initialize(config=args.deepspeed_config, model=model,
File "/export/home/project/llm/DeepSpeed/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py", line 267, in __init__
self._configure_distributed_model(model)
File "/export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py", line 1049, in _configure_distributed_model
self.module.to(self.device)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1878, in to
return super().to(*args, **kwargs)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 228.00 MiB (GPU 6; 39.59 GiB total capacity; 38.14 GiB already allocated; 226.12 MiB free; 38.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My config file:
{
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"betas": [
0.9,
0.999
],
"eps": 1e-8,
"weight_decay": 0.1
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.00004,
"warmup_num_steps": 300,
"warmup_type": "linear",
"total_num_steps": 3000
}
},
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"allgather_bucket_size": 5e8,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"sub_group_size": 1e11,
"stage3_gather_16bit_weights_on_model_save": true,
"offload_param": {
"device": "cpu",
"pin_memory": false
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
}
},
"gradient_clipping": 1,
"steps_per_print": 10,
"wall_clock_breakdown": false,
"compression_training": {
"weight_quantization": {
"shared_parameters": {},
"different_groups": {}
},
"activation_quantization": {
"shared_parameters": {},
"different_groups": {}
},
"sparse_pruning": {
"shared_parameters": {},
"different_groups": {}
},
"row_pruning": {
"shared_parameters": {},
"different_groups": {}
},
"head_pruning": {
"shared_parameters": {},
"different_groups": {}
},
"channel_pruning": {
"shared_parameters": {},
"different_groups": {}
}
},
"train_batch_size": 128,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16
}
Wow, this started in May and still hasn't closed, Deepspeed folks are really slow!
Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...
I can't even train 3b model with the same config posted here
@djaym7 I can train 3b, 7b and 13b under same environment. In particular, these 3 models consume normal RAM, e.g. 100G ~ 200G. However, the 33B will dramatically consume CPU RAM over 600G. I think this is due to 33B model is larger than single A100 (40G), and lead to unknown errors.
Before llama impl is merged in mega-ds, we implemented another llama in our private repo. And we found that U can at most train 13B llama w/o offloading with 8 40GB A100. So I guess U just can't.
@nrailgun Have you tried about w/ offload? In my case, I offload optimizer to RAM for 33B, and it do train smoothly. The issue occurs in saving.
I am likely doing something wrong, @LuJunru do you have your training code on git?
@djaym7 Not yet, I recommend you to follow alpaca: https://github.com/tatsu-lab/stanford_alpaca. Most of settings are similar.
Thanks, I was trying stage 1 and 2 deepspeed, will tryout fsdp in trainer too. Thanks
Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...
@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.
@LuJunru how do you make it work on 8*A100 40G? Do you use just the same config as this?
@memray Exactly. I used deepspeed zero3 offloads + flash attention.
@LuJunru I have CUDA OOM error every time, even on 16gpu nodes. It moves the model to gpus during initialization, even I use stage3. I will try flash attention.
│ /export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py:268 in │
│ __init__ │
│ │
│ 265 │ │ self.pipeline_parallelism = isinstance(model, PipelineModule) │
│ 266 │ │ │
│ 267 │ │ # Configure distributed model │
│ ❱ 268 │ │ self._configure_distributed_model(model) │
│ 269 │ │ │
│ 270 │ │ self._get_model_parameters() │
│ 271 │
│ │
│ /export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py:1069 in │
│ _configure_distributed_model │
│ │
│ 1066 │ │ │
│ 1067 │ │ # zero.Init() handles device placement of model │
│ 1068 │ │ if not self.dont_change_device: │
│ ❱ 1069 │ │ │ self.module.to(self.device) │
│ 1070 │ │ │
│ 1071 │ │ # MoE related initialization │
│ 1072 │ │ for _, module in self.module.named_modules(): │
│
@memray You may probably test following official strategies, here's one from HF https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance:
First of all set batch size to 1 (you can always use gradient accumulation for any desired effective batch size). 1 - Enable --gradient_checkpointing 1 (HF Trainer) or directly model.gradient_checkpointing_enable() - if OOM then 2 - Try ZeRO stage 2 first. if OOM then 3 - Try ZeRO stage 2 + offload_optimizer - if OOM then 4 - Switch to ZeRO stage 3 - if OOM then 5 - Enable offload_param to cpu - if OOM then 6 - Enable offload_optimizer to cpu - if OOM then 7 - If you still can’t fit a batch size of 1 first check various default values and lower them if you can. For example, if you use generate and you don’t use a wide search beam make it narrower as it’d take a lot of memory. 8 - Definitely use mixed half-precision over fp32 - so bf16 on Ampere and higher GPUs and fp16 on older gpu architectures. 9 - If you still OOM you could add more hardware or enable ZeRO-Infinity - that is switch offloads offload_param and offload_optimizer to nvme. You need to make sure it’s a very fast nvme. As an anecdote I was able to infer BLOOM-176B on a tiny GPU using ZeRO-Infinity except it was extremely slow. But it worked!
From my experience, it works at 6.
@LuJunru Hi, does this mean you have successfully finetuned a 33-B-parameter model using zero stage3 + offload optimizer & param on A100 40G 8 + 600G CPU RAM? I used A100 80G 8 + 1T RAM, but still encountered CPU RAM OOM (exitcode: -9). Would you mind sharing your environment configuration, such as the version of deepspeed, flash-attn, and cuda? Also, did you use bf16? Thank you very much!
@s1ghhh Sure. Here's some configs:
deepspeed: 0.9.2 torch: 2.0.1 (flash attention is in it) cuda: V11.3.109
I used 800G CPU RAM when I use batch 8, accumulation 2, and received memory pressure warning. Reduce batch will be helpful. I guess you could run with batch 8 under 1T RAM.
@LuJunru Many Thanks! Would you mind sharing your Deepspeed script, please? I have tried other scripts from this issue and Deepspeed's official default script, but I am hoping to rule out any issues related to the Deepspeed configuration script. Thank you again for your willingness to share. In any case, I will make an effort to try it out and publish the results.
@s1ghhh I'm afraid I can't right now. We hope to release it next month.
@LuJunru I understand your situation. Thanks again.
@LuJunru thanks for sharing the information. My code got stuck at here (as shown below), since it moves the whole model to GPU during initialization, training hasn't even started. I don't really understand why it behaves this way... By the way, can you let me know which Huggingface checkpoint you are using? Is it huggyllama/llama-30b
?
# zero.Init() handles device placement of model
if not self.dont_change_device:
self.module.to(self.device)
@memray I used to meet similar issues. In my situation, it was caused by environmental variable: CUDA_LAUNCH_BLOCKING=1, not sure about your case. I fine-tuned on Vicuna 33B.
@LuJunru Thanks! But it didn't work out for me :( One last thing to confirm, are you doing full-model tuning or LoRA?
@memray
Full-model tuning
@LuJunru really appreciated! Do you mind sharing which codebase you work on, so I can refer to it for details? Also are you loading Vicuna 33B using Huggingface from_pretrained(), like lmsys/vicuna-33b-v1.3
? I'm using the code below to load the model
dschf = HfDeepSpeedConfig(args.deepspeed_config) # keep this object alive
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-30b')
But I run into strange errors like RuntimeError: NCCL Error 1: unhandled cuda error
. I'm thinking whether the error stems from the integration of HF and DeepSpeed. So your successful experience is greatly appreciated.
Best, Rui
@memray
Hi Rui,
You can in reference to: https://github.com/tatsu-lab/stanford_alpaca. I used trainer class from HF to load models, and just use --deepspeed to add DP plugin. Hope this can help you!
Junru
Can we configure DeepSpeed to load only 2-3 models onto 8 GPUs, rather than loading 8 models onto 8 GPUs?
Hi @s1ghhh and @memray, you can check my general scripts here: https://github.com/LuJunru/LLM_SFT/tree/main if you still need.
@Dominic789654 Actually it's not DeepSpeed's problem:
i'm faced exactly the same question when using Zero-stage2, when loading Llama-2-7b takes 220+gb RAM and Llama-2-13b going up to OOM (my device only have 250gb RAM)
The questions is LlamaForCausalLM.from_pretrained(model_name_or_path) loading shards on cpu by default, adding the parameter device_map
=auto
will resolve it
moreover, the loading weights is defaulted on fp32, u need set torch_dtype
=torch.float16
see at https://discuss.huggingface.co/t/llama-7b-gpu-memory-requirement/34323
Additional, i also set low_cpu_mem_usage
=True
Last, device_map
=auto
and low_cpu_mem_usage
=True
mismatch with Zero-stage3
Hope it works for you😛
Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...
@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.
@LuJunru Hi, we met the the same issue that cannot save in the final when we finetune 33B. Could you share more about how you solve it? Thanks!
Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...
@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.
@LuJunru Hi, we met the the same issue that cannot save in the final when we finetune 33B. Could you share more about how you solve it? Thanks!
@Zui-C Hi, here's my saving functions: https://github.com/LuJunru/LLM_SFT/blob/main/code/codes/train/train.py#L175.
I am fine-tuning the llama 33B Llama model on a server with 8*A100 40G GPUs and 600GB RAM, but I keep running into OOM on RAM. I am mainly using the default zero3.config template.
I've tried modifying this config by not offloading parameters and only offloading the optimizer to the CPU, or not offloading parameters and only offloading the optimizer to the NVMe. However, none of these attempts have been successful, as they all result in OOM RAM. Do you have any suggestions for my situation?