[BUG] try to finetune a llama 33b on 8*A100 40G, 600G RAM. But always OOM on RAM.

Dominic789654 commented 1 year ago

I am fine-tuning the llama 33B Llama model on a server with 8*A100 40G GPUs and 600GB RAM, but I keep running into OOM on RAM. I am mainly using the default zero3.config template.

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

I've tried modifying this config by not offloading parameters and only offloading the optimizer to the CPU, or not offloading parameters and only offloading the optimizer to the NVMe. However, none of these attempts have been successful, as they all result in OOM RAM. Do you have any suggestions for my situation?

hujunchao commented 1 year ago

same question

KelleyYin commented 1 year ago

Have you solved this problem?

tjruwase commented 1 year ago

Can you please share a stack trace?

Also, please try setting all pin_memory to false.

Dominic789654 commented 1 year ago

Can you please share a stack trace?

Also, please try setting all pin_memory to false.

I think the OOM issue on the RAM is happening because DeepSpeed is trying to load eight models at the same time, which is causing the CPU memory to not have enough space for offloading. Is there a way in DeepSpeed to set arguments to load the models one by one?

tjruwase commented 1 year ago

@Dominic789654, what you suggest is theoretically possible. However, without seeing the code, it is unclear to me whether DeepSpeed is actually loading the checkpoints, as opposed to HF for example. So, a stack trace at the minimum would be helpful to understand what is actually going on. Thanks!

leiwen83 commented 1 year ago

@Dominic789654 you may try my latest PR https://github.com/microsoft/DeepSpeed/pull/3629 This patch would allow loading checkpoint in serial way, so that it would not lead to memory peak for resume from the checkpoint training.

memray commented 1 year ago

@tjruwase Almost the same setting (finetuning llama 33b on 8*A100 40G, 670G RAM). It looks like it reports CUDA OOM while moving the model to GPUs (33B requires at least 66GB memory). Neither stage3_max_live_parameters nor offloading (to cpu or nvme) matters. For some reason, engine.py L1048 is_zero3_model is False even I set it True in config.

Initializing deepspeed took 18.02s
Traceback (most recent call last):
  File "train_deepspeed.py", line 323, in train
    model_engine, optimizer, _, scheduler = deepspeed.initialize(config=args.deepspeed_config, model=model,
  File "/export/home/project/llm/DeepSpeed/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py", line 267, in __init__
    self._configure_distributed_model(model)
  File "/export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py", line 1049, in _configure_distributed_model
    self.module.to(self.device)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1878, in to
    return super().to(*args, **kwargs)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/export/share/ruimeng/env/anaconda/envs/codegen/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 228.00 MiB (GPU 6; 39.59 GiB total capacity; 38.14 GiB already allocated; 226.12 MiB free; 38.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

memray commented 1 year ago

My config file:

{
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 0.1
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.00004,
      "warmup_num_steps": 300,
      "warmup_type": "linear",
      "total_num_steps": 3000
    }
  },
  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,
    "allgather_bucket_size": 5e8,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "sub_group_size": 1e11,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_param": {
      "device": "cpu",
      "pin_memory": false
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": false
    }
  },
  "gradient_clipping": 1,
  "steps_per_print": 10,
  "wall_clock_breakdown": false,
  "compression_training": {
    "weight_quantization": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "activation_quantization": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "sparse_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "row_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "head_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    },
    "channel_pruning": {
      "shared_parameters": {},
      "different_groups": {}
    }
  },
  "train_batch_size": 128,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 16
}

djaym7 commented 1 year ago

Wow, this started in May and still hasn't closed, Deepspeed folks are really slow!

LuJunru commented 1 year ago

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...

djaym7 commented 1 year ago

I can't even train 3b model with the same config posted here

LuJunru commented 1 year ago

@djaym7 I can train 3b, 7b and 13b under same environment. In particular, these 3 models consume normal RAM, e.g. 100G ~ 200G. However, the 33B will dramatically consume CPU RAM over 600G. I think this is due to 33B model is larger than single A100 (40G), and lead to unknown errors.

nrailg commented 1 year ago

Before llama impl is merged in mega-ds, we implemented another llama in our private repo. And we found that U can at most train 13B llama w/o offloading with 8 40GB A100. So I guess U just can't.

LuJunru commented 1 year ago

@nrailgun Have you tried about w/ offload? In my case, I offload optimizer to RAM for 33B, and it do train smoothly. The issue occurs in saving.

djaym7 commented 1 year ago

I am likely doing something wrong, @LuJunru do you have your training code on git?

LuJunru commented 1 year ago

@djaym7 Not yet, I recommend you to follow alpaca: https://github.com/tatsu-lab/stanford_alpaca. Most of settings are similar.

djaym7 commented 1 year ago

Thanks, I was trying stage 1 and 2 deepspeed, will tryout fsdp in trainer too. Thanks

LuJunru commented 1 year ago

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...

@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.

memray commented 1 year ago

@LuJunru how do you make it work on 8*A100 40G? Do you use just the same config as this?

LuJunru commented 1 year ago

@memray Exactly. I used deepspeed zero3 offloads + flash attention.

memray commented 1 year ago

@LuJunru I have CUDA OOM error every time, even on 16gpu nodes. It moves the model to gpus during initialization, even I use stage3. I will try flash attention.

│ /export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py:268 in        │
│ __init__                                                                     │
│                                                                              │
│    265 │   │   self.pipeline_parallelism = isinstance(model, PipelineModule) │
│    266 │   │                                                                 │
│    267 │   │   # Configure distributed model                                 │
│ ❱  268 │   │   self._configure_distributed_model(model)                      │
│    269 │   │                                                                 │
│    270 │   │   self._get_model_parameters()                                  │
│    271                                                                       │
│                                                                              │
│ /export/home/project/llm/DeepSpeed/deepspeed/runtime/engine.py:1069 in       │
│ _configure_distributed_model                                                 │
│                                                                              │
│   1066 │   │                                                                 │
│   1067 │   │   # zero.Init() handles device placement of model               │
│   1068 │   │   if not self.dont_change_device:                               │
│ ❱ 1069 │   │   │   self.module.to(self.device)                               │
│   1070 │   │                                                                 │
│   1071 │   │   # MoE related initialization                                  │
│   1072 │   │   for _, module in self.module.named_modules():                 │
│

LuJunru commented 1 year ago

@memray You may probably test following official strategies, here's one from HF https://huggingface.co/docs/transformers/main_classes/deepspeed#how-to-choose-which-zero-stage-and-offloads-to-use-for-best-performance:

First of all set batch size to 1 (you can always use gradient accumulation for any desired effective batch size). 1 - Enable --gradient_checkpointing 1 (HF Trainer) or directly model.gradient_checkpointing_enable() - if OOM then 2 - Try ZeRO stage 2 first. if OOM then 3 - Try ZeRO stage 2 + offload_optimizer - if OOM then 4 - Switch to ZeRO stage 3 - if OOM then 5 - Enable offload_param to cpu - if OOM then 6 - Enable offload_optimizer to cpu - if OOM then 7 - If you still can’t fit a batch size of 1 first check various default values and lower them if you can. For example, if you use generate and you don’t use a wide search beam make it narrower as it’d take a lot of memory. 8 - Definitely use mixed half-precision over fp32 - so bf16 on Ampere and higher GPUs and fp16 on older gpu architectures. 9 - If you still OOM you could add more hardware or enable ZeRO-Infinity - that is switch offloads offload_param and offload_optimizer to nvme. You need to make sure it’s a very fast nvme. As an anecdote I was able to infer BLOOM-176B on a tiny GPU using ZeRO-Infinity except it was extremely slow. But it worked!

From my experience, it works at 6.

s1ghhh commented 1 year ago

@LuJunru Hi, does this mean you have successfully finetuned a 33-B-parameter model using zero stage3 + offload optimizer & param on A100 40G 8 + 600G CPU RAM? I used A100 80G 8 + 1T RAM, but still encountered CPU RAM OOM (exitcode: -9). Would you mind sharing your environment configuration, such as the version of deepspeed, flash-attn, and cuda? Also, did you use bf16? Thank you very much!

LuJunru commented 1 year ago

@s1ghhh Sure. Here's some configs:

deepspeed: 0.9.2 torch: 2.0.1 (flash attention is in it) cuda: V11.3.109

I used 800G CPU RAM when I use batch 8, accumulation 2, and received memory pressure warning. Reduce batch will be helpful. I guess you could run with batch 8 under 1T RAM.

s1ghhh commented 1 year ago

@LuJunru Many Thanks! Would you mind sharing your Deepspeed script, please? I have tried other scripts from this issue and Deepspeed's official default script, but I am hoping to rule out any issues related to the Deepspeed configuration script. Thank you again for your willingness to share. In any case, I will make an effort to try it out and publish the results.

LuJunru commented 1 year ago

@s1ghhh I'm afraid I can't right now. We hope to release it next month.

s1ghhh commented 1 year ago

@LuJunru I understand your situation. Thanks again.

memray commented 1 year ago

@LuJunru thanks for sharing the information. My code got stuck at here (as shown below), since it moves the whole model to GPU during initialization, training hasn't even started. I don't really understand why it behaves this way... By the way, can you let me know which Huggingface checkpoint you are using? Is it huggyllama/llama-30b?

# zero.Init() handles device placement of model
if not self.dont_change_device:
    self.module.to(self.device)

LuJunru commented 1 year ago

@memray I used to meet similar issues. In my situation, it was caused by environmental variable: CUDA_LAUNCH_BLOCKING=1, not sure about your case. I fine-tuned on Vicuna 33B.

memray commented 1 year ago

@LuJunru Thanks! But it didn't work out for me :( One last thing to confirm, are you doing full-model tuning or LoRA?

LuJunru commented 1 year ago

@memray

Full-model tuning

memray commented 1 year ago

@LuJunru really appreciated! Do you mind sharing which codebase you work on, so I can refer to it for details? Also are you loading Vicuna 33B using Huggingface from_pretrained(), like lmsys/vicuna-33b-v1.3? I'm using the code below to load the model

dschf = HfDeepSpeedConfig(args.deepspeed_config)  # keep this object alive
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-30b')

But I run into strange errors like RuntimeError: NCCL Error 1: unhandled cuda error. I'm thinking whether the error stems from the integration of HF and DeepSpeed. So your successful experience is greatly appreciated.

Best, Rui

LuJunru commented 1 year ago

@memray

Hi Rui,

You can in reference to: https://github.com/tatsu-lab/stanford_alpaca. I used trainer class from HF to load models, and just use --deepspeed to add DP plugin. Hope this can help you!

Junru

cwzhao commented 1 year ago

Can we configure DeepSpeed to load only 2-3 models onto 8 GPUs, rather than loading 8 models onto 8 GPUs?

LuJunru commented 1 year ago

Hi @s1ghhh and @memray, you can check my general scripts here: https://github.com/LuJunru/LLM_SFT/tree/main if you still need.

Honesty-of-the-Cavernous-Tissue commented 1 year ago

@Dominic789654 Actually it's not DeepSpeed's problem: i'm faced exactly the same question when using Zero-stage2, when loading Llama-2-7b takes 220+gb RAM and Llama-2-13b going up to OOM (my device only have 250gb RAM) The questions is LlamaForCausalLM.from_pretrained(model_name_or_path) loading shards on cpu by default, adding the parameter device_map=auto will resolve it moreover, the loading weights is defaulted on fp32, u need set torch_dtype=torch.float16 see at https://discuss.huggingface.co/t/llama-7b-gpu-memory-requirement/34323 Additional, i also set low_cpu_mem_usage=True Last, device_map=auto and low_cpu_mem_usage=True mismatch with Zero-stage3 Hope it works for you😛

Zui-C commented 11 months ago

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...

@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.

@LuJunru Hi, we met the the same issue that cannot save in the final when we finetune 33B. Could you share more about how you solve it? Thanks!

LuJunru commented 11 months ago

Any one find solutions? I try to finetune 33B on 8 * A100 40G, 900G RAM. It consumes 680GB during training, and can not save in the final as saving will lead to OOM on 900G RAM...

@djaym7 OK, I found it is other process stuck my saving. I can briefly report here that I used 750G ~ 800G RAM for training and saving (the seq len is 2048). It could be finetuned on single node with 8 * A100 40G. If you don't have such RAM capacity, try use multiple nodes, deepspeed can split RAM consumption over nodes.

@LuJunru Hi, we met the the same issue that cannot save in the final when we finetune 33B. Could you share more about how you solve it? Thanks!

@Zui-C Hi, here's my saving functions: https://github.com/LuJunru/LLM_SFT/blob/main/code/codes/train/train.py#L175.

microsoft / DeepSpeed

[BUG] try to finetune a llama 33b on 8*A100 40G, 600G RAM. But always OOM on RAM. #3448