multi-gpu training - Githubissues

System Info

python:3.12.4
transformers:4.45.2
trl:0.11.4
huggingface:0.25.2
accelerate:1.0.1

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

About I am trying to fine-tune llama on multiple GPU using trl library. While training, I noticed that gpu:0 is actively computing, while other GPUs set idle despite their VRAM are consumed. I feel like this is an unexpected act, expecting all GPUs would be busy during training. I looked for this issue but fit for my case.

Here is the relevant code.

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

Checking if torch get all the device, and it does.

torch.cuda.device_count()
3

Checking with accelerate (suspicious).

accelerator = Accelerator()
print(f"Using {accelerator.num_processes} processes.")
print(f"Process index: {accelerator.process_index}")
Using 1 processes.
Process index: 0

device_string = PartialState().process_index
device_string
0

Defining model and set device_map="auto".

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

torch_dtype = torch.float16
attn_implementation = "eager"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model, tokenizer = setup_chat_format(model, tokenizer)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        'up_proj', 'down_proj', 'gate_proj', 
        'k_proj', 'q_proj', 'v_proj', 'o_proj'
    ]
)
model = get_peft_model(model, peft_config)

Defining training arguments. Set gradient_checkpointing_kwargs={"use_reentrant": False} as I read in other issue, it is required.

training_arguments = TrainingArguments(
    output_dir='result',
    per_device_train_batch_size=3*3,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=True,
    bf16=False,
    group_by_length=True,
    report_to="wandb",
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

Checking layout map of model's on different device. It looks ok, Layers are placed on multiple GPUs (0, 1, 2).

for k,v in model.hf_device_map.items():
    print(k,v)
model.embed_tokens 0
model.layers.0 0
model.layers.1 0
model.layers.2 1
model.layers.3 1
model.layers.4 1
model.layers.5 1
model.layers.6 1
model.layers.7 1
model.layers.8 1
model.layers.9 1
model.layers.10 1
model.layers.11 1
model.layers.12 1
model.layers.13 1
model.layers.14 1
model.layers.15 1
model.layers.16 2
model.layers.17 2
model.layers.18 2
model.layers.19 2
model.layers.20 2
model.layers.21 2
model.layers.22 2
model.layers.23 2
model.layers.24 2
model.layers.25 2
model.layers.26 2
model.layers.27 2
model.layers.28 2
model.layers.29 2
model.layers.30 2
model.layers.31 2
model.norm 2
model.rotary_emb 2
lm_head 2

Start training.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    max_seq_length=512,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)
trainer.train()

GPU usages

Expected behavior

At this point, I'm not sure if all GPUs are working as expected or not.

huggingface / trl

multi-gpu training #2256

System Info

Information

Tasks

Reproduction

Expected behavior