huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.59k stars 917 forks source link

GPU Memory Imbalance and OOM Errors During Training #2789

Closed DONGRYEOLLEE1 closed 1 month ago

DONGRYEOLLEE1 commented 3 months ago

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /data/envs/tt/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 125.62 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/data/dev/', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'EAGER', 'dynamo_mode': 'default', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}

Information

Tasks

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successfully proceeded with the training, but the GPU memory was not allocated evenly. As a result, I encountered an OOM error before completing even 100 steps. Upon checking the GPU memory during training, the imbalance appeared to be even more severe. In my case, GPU 1 used more memory than GPU 0.

I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

Below are the results of checking the GPU memory using nvidia-smi during training. The issue of memory imbalance allocation is very serious!

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

This is my script:

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 32,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

args = TrainingArguments(
    num_train_epochs = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-Ko-test",
    optim = "paged_adamw_8bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-Ko-test",
)

def formatting_func(x):
    return [x]

model.is_parallelizable = True
model.model_parallel = True

trainer = SFTTrainer(
    model = model,
    args = args,
    train_dataset = tok_data['train'],
    formatting_func = formatting_func,
)

trainer.train()

Expected behavior

How can I resolve the GPU memory imbalance issue?

DONGRYEOLLEE1 commented 3 months ago

I only changed the model to Llama2, and although the memory imbalance issue still exists, the training works well under the following conditions.

What is the issue with the Llama3 series models?

How on earth can I fix this issue?


MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"  # change a model
...
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    278782      C   python                                    35422MiB |
|    1   N/A  N/A    278782      C   python                                    45988MiB |
+---------------------------------------------------------------------------------------+
muellerzr commented 3 months ago

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

SunMarc commented 3 months ago

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

DONGRYEOLLEE1 commented 3 months ago

@SunMarc

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

First of all, I really thank to your reply.

The following shows the GPU memory status right after loading the Llama3 model.

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   39C    P8              27W / 300W |   2212MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 30%   42C    P8              33W / 300W |   3990MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                2206MiB |
|    1   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                3984MiB |
+---------------------------------------------------------------------------------------+
DONGRYEOLLEE1 commented 3 months ago

@muellerzr

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

Could you let me know the version of peft you used for fine-tuning?

In my case, I used a peft==0.10.0.

muellerzr commented 3 months ago

@DONGRYEOLLEE1 i did not use PEFT, hence what I meant by full-fine-tuning with FSDP

BenjaminBossan commented 3 months ago

I tried to reproduce but still have very little experience with DeepSpeed, so I may be doing something wrong. When I try to start the script with accelerate launch, I get:

ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode

So @DONGRYEOLLEE1 did you just launch with python ...? If I do that, I also get imbalanced memory, but I'm not sure if this is using DS correctly.

when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced.

Did you change anything else? As the model is bnb quantized, full fine-tuning should not work, right?

muellerzr commented 3 months ago

@BenjaminBossan I needed CPU offloading to get it working, so quite slow but no bnb/quantization was used.

DONGRYEOLLEE1 commented 3 months ago

@BenjaminBossan

I just launched in jupyter notebook instead of python script for python ....

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

The following shows the GPU memory status when using the DS+QLoRA method. (batch_size = 2)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 43%   67C    P2             212W / 300W |  13532MiB / 49140MiB |     84%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             203W / 300W |  12328MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4026      C   /data/envs/llm_test/bin/python            13526MiB |
|    1   N/A  N/A      4027      C   /data/envs/llm_test/bin/python            12322MiB |
+---------------------------------------------------------------------------------------+
BenjaminBossan commented 3 months ago

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

DONGRYEOLLEE1 commented 3 months ago

In the end, I solved the issue using DeepSpeed + QLoRA for example. And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

Oh, This issue wasn't solved for my script.

BenjaminBossan commented 3 months ago

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

DONGRYEOLLEE1 commented 3 months ago

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

My tarining script is provided in the reproduction section above.

  1. This is the state of the GPU shortly after the start of training.
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  1. This is the state of the GPU just before OOM.
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  1. The GPU state of 13.3GB / 12.4GB reflects the GPU status during training using the deepspeed+qlora methodology (It works well w/o any imbalancing !!). While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.
BenjaminBossan commented 3 months ago

My tarining script is provided in the reproduction section above.

Yes, I mean how do you launch the training script exactly?

3. While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

Thanks for clarifying. In that case, I don't think it's PEFT related. @muellerzr any idea why this could be? Is some setting not being passed correctly?

muellerzr commented 2 months ago

I'd need to see the entire notebook/a full reproducer entirely/how you are launching it with the notebook_launcher. There could be some weird things with torch perhaps, I can try and look into this a little

Paul-Richmond commented 2 months ago

I am also encountering this behaviour whilst trying to fine-tune Llama3-8B using QLoRA. However, in my case I'm not using DeepSpeed (at least there's no deepspeed_config parameter in my accelerator config file). My script is launched with python3.

Here's the output from nvidia-smi during training: nvidia-smi

SunMarc commented 2 months ago

Hi @Paul-Richmond, could you print model.hf_device_map. The imbalance is quite strange since it only uses the second and the third gpu. Could you also share a minimal reproducer ? Thanks !

Paul-Richmond commented 2 months ago

Hi @SunMarc, thanks for the quick reply! I'm running my script on an HPC cluster where I only request 2 GPUs from a node comprising of 4 GPUs in total.

Here is a minimal reproducer script:

import os
from dotenv import load_dotenv
import wandb
import huggingface_hub
from datasets import load_dataset
from transformers import (AutoTokenizer,
                          DataCollatorForLanguageModeling,
                          AutoModelForCausalLM,
                          BitsAndBytesConfig,
                          TrainingArguments,
                          Trainer,
                          )
from transformers.optimization import get_cosine_with_min_lr_schedule_with_warmup
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def create_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

def main():
    load_dotenv()
    HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY")
    WANDB_TOKEN = os.getenv("WANDB_API_KEY")

    huggingface_hub.login(token=HF_TOKEN)
    wandb.login(key=WANDB_TOKEN)

    ds = load_dataset("yelp_review_full", split="train[:73047]")

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
    tokenizer.pad_token = tokenizer.eos_token
    tokenised_ds = ds.map(lambda examples: tokenizer(examples["text"],
                                                     padding="max_length",
                                                     max_length=720,
                                                     truncation=True),
                          batched=True,
                          remove_columns=ds.column_names)

    lm_dataset = tokenised_ds.map(create_labels, batched=True)
    train_dataset = lm_dataset

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    training_args = TrainingArguments(output_dir="hf",
                                      evaluation_strategy="no",
                                      per_device_train_batch_size=24,
                                      per_device_eval_batch_size=24,
                                      max_grad_norm=1.0,
                                      report_to="wandb",
                                      run_name="GPU_memory_imbalance",
                                      push_to_hub=False)

    quant_config = BitsAndBytesConfig(load_in_4bit=True,
                                      bnb_4bit_quant_type="nf4",
                                      bnb_4bit_quant_storage=None,
                                      bnb_4bit_compute_dtype="bfloat16",
                                      bnb_4bit_use_double_quant=True)

    lora_config = LoraConfig(r=8,
                             lora_alpha=32,
                             lora_dropout=0.05,
                             bias="none",
                             task_type="CAUSAL_LM",
                             target_modules=["up_proj",
                                             "down_proj",
                                             "gate_proj",
                                             "k_proj",
                                             "q_proj",
                                             "v_proj",
                                             "o_proj"])

    foundation_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B",
                                                            device_map="auto",
                                                            trust_remote_code=True,
                                                            attn_implementation="flash_attention_2",
                                                            quantization_config=quant_config
                                                            )
    print(f"foundation_model hf_device_map: {foundation_model.hf_device_map}")
    model = prepare_model_for_kbit_training(foundation_model)
    print(f"prepare_model_for_kbit_training hf_device_map: {model.hf_device_map}")
    model = get_peft_model(model, lora_config)
    print(f"get_peft_model hf_device_map: {model.hf_device_map}")

    optimizer = torch.optim.AdamW(model.parameters(),
                                  lr=0.0003,
                                  weight_decay=0.1,
                                  betas=(0.9, 0.95),
                                  eps=1.0e-05)

    lr_scheduler = get_cosine_with_min_lr_schedule_with_warmup(optimizer,
                                                               num_training_steps=9132,
                                                               num_warmup_steps=91,
                                                               num_cycles=0.5,
                                                               last_epoch=-1,
                                                               min_lr=0.1)

    trainer = Trainer(model=model,
                      args=training_args,
                      train_dataset=train_dataset,
                      eval_dataset=None,
                      data_collator=data_collator,
                      optimizers=(optimizer, lr_scheduler)
                      )
    print(f"trainer hf_device_map: {trainer.model.hf_device_map}")
    trainer.train()
    huggingface_hub.logout()
    wandb.finish()

if __name__ == "__main__":
    main()

The result from model.hf_device_map is as follows: Untitled 2 There does seem to be an imbalance with 9 quantities mapped to GPU0 and 26 to GPU1.

The nvidia-smi output is as before only now GPUs 0 and 1 are being used:

Untitled
SunMarc commented 2 months ago

Thanks for the reproducer @Paul-Richmond ! I'll keep you updated !

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.