huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.65k stars 26.43k forks source link

ValueError: No columns in the dataset match the model's forward method signature when using SFTTrainer and DataParallel. #33119

Open llCurious opened 1 month ago

llCurious commented 1 month ago

System Info

Who can help?

@muellerzr @SunMarc @ArthurZucker

Information

Tasks

Reproduction

MODEL = "google/gemma-2-2b-it"

model = AutoModelForCausalLM.from_pretrained(
      MODEL, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# if we do not use DataParallel, everything works fine.
# If DataParallel is used, ValueError: No columns in the dataset match the model's forward method signature.
if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)
print(model)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=4)

from trl import SFTTrainer
from transformers import TrainingArguments

train_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    # Use num_train_epochs = 1, warmup_ratio for full training runs!
    # warmup_steps=20,
    max_steps = 10,
    # num_train_epochs=2,
    learning_rate=5e-5,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    args=train_args,
)

Expected behavior

Expected: No ValueError: No columns in the dataset match the model's forward method signature. is raised.

It seems to me the error occurs since DataParallel wraps the model.

However, I wonder the preprocessing logic in SFTTrainer.

muellerzr commented 1 month ago

The underlying Trainer wraps the model. There should be an arg to enable DP (will try and find it in a moment)

wbuchanan commented 4 weeks ago

It might be useful if the error message could provide the forward method's signature so users would know what columns need to exist in the dataset object.

SunMarc commented 4 weeks ago

Thanks for the feedback @wbuchanan ! Would you like to submit a PR to add this information ?

wbuchanan commented 4 weeks ago

@SunMarc if I knew where to find the information programmatically I could try, but it isn't clear where the information would be located.

SunMarc commented 4 weeks ago

Right here : https://github.com/huggingface/transformers/blob/e259d6d1e0d2acfa3c2f84b11c9bfa97e64b984d/src/transformers/trainer.py#L840. You can just add in the error msg the signature_columns variable !

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.