Low performance on mps backed

yaroslavyaroslav commented 2 weeks ago

System Info

accelerate         0.33.0
peft               0.12.0
Python 3.12.5
macOS: 15.0
MacBook pro M1 Pro 16 gb

Who can help?

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
from transformers import TrainerCallback
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
print(torch.backends.mps.is_available())   # Should return True if MPS is available
print(torch.backends.mps.is_built())

torch.cuda.empty_cache()
# Load Gemma2-2B model and tokenizer
model_name = "mlx-community/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation='eager')

# Set the device to MPS (Apple Silicon)
device = torch.device("mps")
model.to(device)
print(next(model.parameters()).device)  # Should show "mps:0"

# Prepare the model for PEFT using LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=[
        "self_attn.k_proj",
        "self_attn.q_proj",
        "self_attn.v_proj",
        "self_attn.o_proj",
        "mlp.down_proj",
        "mlp.gate_proj",
        "mlp.up_proj"
    ],  # Specify your target modules here
)
model = get_peft_model(model, lora_config)

# Load your JSONL datasets

dataset = load_dataset('json', data_files={
    'train': '/path-to-dataset/user_data/train.jsonl',
    'test': '/path-to-dataset/user_data/test.jsonl',
    'validation': '/path-to-dataset/user_data/valid.jsonl'
})

# Ensure the columns are named 'input' for compatibility
dataset = dataset.rename_column("text", "input")

# Tokenization function
def tokenize_function(examples):
   input_ids = tokenizer(examples["input"], truncation=True, padding="max_length", max_length=128)["input_ids"]
   labels = [[-100] + ids[:-1] for ids in input_ids]  # Shift input_ids for labels
   return {"input_ids": input_ids, "labels": labels}

# Tokenize the datasets
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Remove the original input column if no longer needed
tokenized_datasets = tokenized_datasets.remove_columns(["input"])

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=25,
    eval_strategy="steps",
    gradient_accumulation_steps=2,  # Simulate larger batch size
    eval_steps=500,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model='loss',  # or choose your preferred metric
    report_to="none"
)

# Example of using logging in your custom Trainer or adding callbacks
class LoggingCallback(TrainerCallback):
    def on_epoch_end(self, args, state, control, **kwargs):
        logging.info(f"Epoch {state.epoch} completed.")

    def on_step_end(self, args, state, control, **kwargs):
        logging.info(f"Step {state.global_step} completed.")

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],  # Include validation dataset
    callbacks=[LoggingCallback()]
)

# Start fine-tuning
trainer.train()

Expected behavior

This pipeline utilises gpu on 10-15% meanwhile cpu is utilised 30-50%.

mlx framework with the quite same lora train setup on the same model utilises gpu twice to third times more.

Such low utilisation leads to quite a slow training progress in comparison to mlx one.

BenjaminBossan commented 2 weeks ago

Thanks for reporting this issue.

I don't have a Mac to try to reproduce the issue, so I cannot really help you here. Honestly, I don't know much about MPS in general and how well it is supported by PyTorch. Still, maybe you could provide some further information and maybe other users who see this issue can give further advise:

Is this observation specific to PEFT training? For instance, if you do full fine-tuning, do you see that the difference disappears?
How much slower is the training on the same task?
Could you provide the code for the MLX training?

PS: Please don't ping "saya", they're not related to this project.

yaroslavyaroslav commented 2 weeks ago

how well it is supported by PyTorch

Honestly, it's awful. I mean, it is presented in some sense, but the list of missed primitives for mps is huge and it doesn't seem getting any shorter. So if this lib leveraging PyTorch as mps backed — it's a bad luck for me.

But anyway I raised this one as a starting point, because as long as I get it, this lib leveraging accelerate lib which is smth like back-end managing layer for all of the different gpu related stuff, and if so it's quite clear that the pain point comes from there. Am I right with that?

ps: sorry for saya mention it was gh completion failure 😅

BenjaminBossan commented 2 weeks ago

So what you're saying is that the slowness stems from the lackluster support of MPS in PyTorch, and as PEFT is using PyTorch, slow MPS performance is expected. Is that right? If there are specific operations in PEFT that could be replaced with alternatives that are more efficient for MPS, let us know, apart from that I don't think there is much we can do.

this lib leveraging accelerate lib which is smth like back-end managing layer for all of the different gpu related stuff

I'd say not quite. PEFT uses a few functions from accelerate, e.g. for moving tensors on and off devices if the base model requires it, but apart from that is pretty much independent of accelerate. Also, accelerate is not so much a "back-end managing layer of the different gpu related stuff", but more so an library for providing a seamless integration of (mostly) training features, like parallelization, dealing with large models, mixed precision, etc. Managing devices is just a "side effect" of dealing with those.

If you use PEFT with Trainer from transformers, that will also use accelerate under the hood, but if you use something else for training, accelerate won't be involved. But in the end, I don't think that accelerate has really any bearing on the performance of MPS.

yaroslavyaroslav commented 2 weeks ago

If there are specific operations in PEFT that could be replaced with alternatives that are more efficient for MPS

Yeah, I'm gonna to dig this thing through sooner than later to get what operations fall back on cpu in peft.

Thank you for the thorough overview of peft stack, it would help on the next step of debugging thing.

So not sure about long open issues treatment here, I'd keep it open that be an eyesore for me, but it's annoying for you feel free to close it, I'll open new one later.

BenjaminBossan commented 1 week ago

All right. We can keep this open for the time being, maybe it helps get some eyes on the topic.

huggingface / peft