ValueError: No columns in the dataset match the model's forward method signature when using SFTTrainer and DataParallel.

llCurious commented 1 month ago

System Info

transformers version: 4.44.0
Platform: Linux-5.10.135-x86_64-with-glibc2.31
Python version: 3.10.14
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.4
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: parallel

Who can help?

@muellerzr @SunMarc @ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

MODEL = "google/gemma-2-2b-it"

model = AutoModelForCausalLM.from_pretrained(
      MODEL, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# if we do not use DataParallel, everything works fine.
# If DataParallel is used, ValueError: No columns in the dataset match the model's forward method signature.
if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)
print(model)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=4)

from trl import SFTTrainer
from transformers import TrainingArguments

train_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    # Use num_train_epochs = 1, warmup_ratio for full training runs!
    # warmup_steps=20,
    max_steps = 10,
    # num_train_epochs=2,
    learning_rate=5e-5,
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    args=train_args,
)

Expected behavior

Expected: No ValueError: No columns in the dataset match the model's forward method signature. is raised.

It seems to me the error occurs since DataParallel wraps the model.

However, I wonder the preprocessing logic in SFTTrainer.

muellerzr commented 1 month ago

The underlying Trainer wraps the model. There should be an arg to enable DP (will try and find it in a moment)

wbuchanan commented 4 weeks ago

It might be useful if the error message could provide the forward method's signature so users would know what columns need to exist in the dataset object.

SunMarc commented 4 weeks ago

Thanks for the feedback @wbuchanan ! Would you like to submit a PR to add this information ?

wbuchanan commented 4 weeks ago

@SunMarc if I knew where to find the information programmatically I could try, but it isn't clear where the information would be located.

SunMarc commented 4 weeks ago

Right here : https://github.com/huggingface/transformers/blob/e259d6d1e0d2acfa3c2f84b11c9bfa97e64b984d/src/transformers/trainer.py#L840. You can just add in the error msg the signature_columns variable !

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers