Falcon model training on multiple GPUs

BigDataMLexplorer commented 2 weeks ago

System Info

transformers version: 4.44.0
Platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28
Python version: 3.9.4
Huggingface_hub version: 0.23.2
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): 2.17.0 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: Tesla V100-FHHL-16GB

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Hello, I tried one jupyter notebook on multiple model trainings. When i load the model, I use device_map = "auto" to split the model on multiple (4) GPUs. After that, I use the Trainer and it does parallel training automatically. It always works except for Falcon (7b and 11b). For other models, parallel training takes place automatically and all 4 graphics cards are connected.

What should I do please? I am posting part of my code and error message:

num_labels = 11  
model = AutoModelForSequenceClassification.from_pretrained("Huggingface_models/Falcon2 11b", num_labels=num_labels,device_map="auto")

from peft import LoraConfig
lora_config = LoraConfig(
    r = 16,
    lora_alpha = 8, 
    target_modules = "all-linear",
    lora_dropout = 0.05, 
    bias = 'none', 
    task_type = 'SEQ_CLS'
)

from peft import prepare_model_for_kbit_training, get_peft_model
model = prepare_model_for_kbit_training(model)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir=".....",
    save_strategy="steps",
    eval_strategy="steps",
    eval_steps=half_steps_per_epoch//2,
    save_steps=half_steps_per_epoch//2,
    learning_rate=1e-4,
    lr_scheduler_type="linear",
    #gradient_accumulation_steps=2,
    #gradient_checkpointing=True,
    per_device_train_batch_size=2,
    #per_device_eval_bat,ch_size=16,
    num_train_epochs=2,            
    weight_decay=0.01,
    #dataloader_num_workers=4,
    #logging_steps=500,
    load_best_model_at_end=True, 
    fp16=True,
    #warmup_ratio=0.1,
    save_total_limit=4,
    #report_to="tensorboard"
)

from transformers import EarlyStoppingCallback

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=Train_tokenized,
        eval_dataset=Eval_tokenized,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
    )

train_result = trainer.train()

ERROR::

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

Expected behavior

.

Rocketknight1 commented 2 weeks ago

Hi @BigDataMLexplorer, these questions are usually better suited to the forums or the Discord!

BigDataMLexplorer commented 2 weeks ago

No, this is a bug. Your page says this should work and it does not. Please do something about it.

Rocketknight1 commented 2 weeks ago

I'm sorry, I misread! I didn't realize the code worked with other models and only failed with Falcon. Can you give us a minimal reproducer (code that we can copy-paste to cause the issue on our systems?) Also cc @muellerzr, since this seems like an accelerate thing,.

BigDataMLexplorer commented 1 week ago

@Rocketknight1 Hi, I gave the code I use in the last message. Just tokenize some of your data. The versions of the libraries I use are also in the last message. Thanks

BigDataMLexplorer commented 1 week ago

@Rocketknight1 @muellerzr Please, do you have some idea, why this does not work for Falcon model but works for other models like Llama3, Nemo, Mistral, Phi3 and so on?

huggingface / transformers