huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.85k stars 1.25k forks source link

multi-gpu training #2256

Closed innat closed 3 days ago

innat commented 1 week ago

System Info

python:3.12.4
transformers:4.45.2
trl:0.11.4
huggingface:0.25.2
accelerate:1.0.1

Information

Tasks

Reproduction

About I am trying to fine-tune llama on multiple GPU using trl library. While training, I noticed that gpu:0 is actively computing, while other GPUs set idle despite their VRAM are consumed. I feel like this is an unexpected act, expecting all GPUs would be busy during training. I looked for this issue but fit for my case.

Here is the relevant code.

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

Checking if torch get all the device, and it does.

torch.cuda.device_count()
3

Checking with accelerate (suspicious).

accelerator = Accelerator()
print(f"Using {accelerator.num_processes} processes.")
print(f"Process index: {accelerator.process_index}")
Using 1 processes.
Process index: 0
device_string = PartialState().process_index
device_string
0

Defining model and set device_map="auto".

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

torch_dtype = torch.float16
attn_implementation = "eager"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model, tokenizer = setup_chat_format(model, tokenizer)
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        'up_proj', 'down_proj', 'gate_proj', 
        'k_proj', 'q_proj', 'v_proj', 'o_proj'
    ]
)
model = get_peft_model(model, peft_config)

Defining training arguments. Set gradient_checkpointing_kwargs={"use_reentrant": False} as I read in other issue, it is required.

training_arguments = TrainingArguments(
    output_dir='result',
    per_device_train_batch_size=3*3,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=True,
    bf16=False,
    group_by_length=True,
    report_to="wandb",
    gradient_checkpointing_kwargs={"use_reentrant": False}
)

Checking layout map of model's on different device. It looks ok, Layers are placed on multiple GPUs (0, 1, 2).

for k,v in model.hf_device_map.items():
    print(k,v)
model.embed_tokens 0
model.layers.0 0
model.layers.1 0
model.layers.2 1
model.layers.3 1
model.layers.4 1
model.layers.5 1
model.layers.6 1
model.layers.7 1
model.layers.8 1
model.layers.9 1
model.layers.10 1
model.layers.11 1
model.layers.12 1
model.layers.13 1
model.layers.14 1
model.layers.15 1
model.layers.16 2
model.layers.17 2
model.layers.18 2
model.layers.19 2
model.layers.20 2
model.layers.21 2
model.layers.22 2
model.layers.23 2
model.layers.24 2
model.layers.25 2
model.layers.26 2
model.layers.27 2
model.layers.28 2
model.layers.29 2
model.layers.30 2
model.layers.31 2
model.norm 2
model.rotary_emb 2
lm_head 2

Start training.

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    max_seq_length=512,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)
trainer.train()

GPU usages

image

image

Expected behavior

At this point, I'm not sure if all GPUs are working as expected or not.

innat commented 3 days ago

what does it mean? , src.

Multiple GPUs, or “model parallelism”, can be utilized but only one GPU will be active at any given moment. This forces the GPU to wait for the previous GPU to send it the output. You should launch your script normally with Python instead of other tools like torchrun and accelerate launch.

You may also be interested in pipeline parallelism which utilizes all available GPUs at once, instead of only having one GPU active at a time. This approach is less flexbile though. For more details, refer to the Memory-efficient pipeline parallelism guide.