huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.05k stars 27.02k forks source link

Falcon model training on multiple GPUs #34492

Open BigDataMLexplorer opened 2 weeks ago

BigDataMLexplorer commented 2 weeks ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

Hello, I tried one jupyter notebook on multiple model trainings. When i load the model, I use device_map = "auto" to split the model on multiple (4) GPUs. After that, I use the Trainer and it does parallel training automatically. It always works except for Falcon (7b and 11b). For other models, parallel training takes place automatically and all 4 graphics cards are connected.

What should I do please? I am posting part of my code and error message:

num_labels = 11  
model = AutoModelForSequenceClassification.from_pretrained("Huggingface_models/Falcon2 11b", num_labels=num_labels,device_map="auto")
from peft import LoraConfig
lora_config = LoraConfig(
    r = 16,
    lora_alpha = 8, 
    target_modules = "all-linear",
    lora_dropout = 0.05, 
    bias = 'none', 
    task_type = 'SEQ_CLS'
)

from peft import prepare_model_for_kbit_training, get_peft_model
model = prepare_model_for_kbit_training(model)

model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
    output_dir=".....",
    save_strategy="steps",
    eval_strategy="steps",
    eval_steps=half_steps_per_epoch//2,
    save_steps=half_steps_per_epoch//2,
    learning_rate=1e-4,
    lr_scheduler_type="linear",
    #gradient_accumulation_steps=2,
    #gradient_checkpointing=True,
    per_device_train_batch_size=2,
    #per_device_eval_bat,ch_size=16,
    num_train_epochs=2,            
    weight_decay=0.01,
    #dataloader_num_workers=4,
    #logging_steps=500,
    load_best_model_at_end=True, 
    fp16=True,
    #warmup_ratio=0.1,
    save_total_limit=4,
    #report_to="tensorboard"
)

from transformers import EarlyStoppingCallback

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=Train_tokenized,
        eval_dataset=Eval_tokenized,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
    )

train_result = trainer.train()

ERROR::

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

Expected behavior

.

Rocketknight1 commented 2 weeks ago

Hi @BigDataMLexplorer, these questions are usually better suited to the forums or the Discord!

BigDataMLexplorer commented 2 weeks ago

No, this is a bug. Your page says this should work and it does not. Please do something about it.

Rocketknight1 commented 2 weeks ago

I'm sorry, I misread! I didn't realize the code worked with other models and only failed with Falcon. Can you give us a minimal reproducer (code that we can copy-paste to cause the issue on our systems?) Also cc @muellerzr, since this seems like an accelerate thing,.

BigDataMLexplorer commented 1 week ago

@Rocketknight1 Hi, I gave the code I use in the last message. Just tokenize some of your data. The versions of the libraries I use are also in the last message. Thanks

BigDataMLexplorer commented 1 week ago

@Rocketknight1 @muellerzr Please, do you have some idea, why this does not work for Falcon model but works for other models like Llama3, Nemo, Mistral, Phi3 and so on?