huggingface / peft

đŸ¤— PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16.47k stars 1.63k forks source link

GPU Allocation Issue (QLoRa + Llama3-8B-IT) #1716

Closed DONGRYEOLLEE1 closed 5 months ago

DONGRYEOLLEE1 commented 6 months ago

System Info

peft: 0.10.1.dev0 accelerate: 0.30.0 bitsandbytes: 0.43.1 transformers: 4.39.3 GPU: A6000 * 2 ( 96GB ) nvidia-driver version: 535.171.04 cuda: 11.8

Who can help?

No response

Information

Tasks

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successed a training, but GPU wasn't evenly allocate. Is it a version issue with peft or transformers? Or is it a version issue with the graphics driver? I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

This is my script.

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 16,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

train_args_trainer = TrainingArguments(
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-7b-4bit-lora-test2",
    optim = "paged_adamw_32bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-7b-4bit-lora-test2",
    remove_unused_columns=False
)

model.is_parallelizable = True
model.model_parallel = True

trainer = Trainer(
    model = model,
    tokenizer = tok,
    args = train_args_trainer,
    train_dataset = toknized_proc_data['train'],
    data_collator = DataCollatorForLanguageModeling(tok, mlm = False)
)

trainer.train()
Wed May  8 06:54:12 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

Expected behavior

I want the GPUs to be evenly allocated.

BenjaminBossan commented 6 months ago

Hmm, hard to say and I can't easily try to reproduce this. Do you already see strange behavior after loading the model, before starting training? If you try without PEFT, do you see the same issue (in case of not having enough memory without PEFT, you could e.g. turn off autograd on most of the layers to "simulate" parameter efficient fine-tuning)? If yes, this could be an accelerate issue.