huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.96k stars 25.83k forks source link

Trainer: To keep unused columns for `compute_metrics` #31570

Open sadra-barikbin opened 1 month ago

sadra-barikbin commented 1 month ago

Hi there!

Currently, columns not used by the model are removed in self.get_*_dataloader() upon data loader creation, but one might want to have them in compute_metrics (when include_inputs_for_metrics=True).

My case is fine-tuning on prompt-completion's and I use tokenizer's token_type_ids as a mask to compute accuracy only on the completion tokens.

To this end, the best way I've come up with is to keep that column in dataset & data loader using remove_unused_columns=False and then remove it in self._prepare_inputs() by overriding it.

Is there a better way to achieve this? Generally, isn't better to move removing unused columns logic to self._prepare_inputs if the logic serves only as the gatekeeper for model(**inputs)?

amyeroberts commented 1 month ago

cc @muellerzr @SunMarc

not-lain commented 1 month ago

@sadra-barikbin for the trainer API you should specify remove_unused_columns=False in the TrainingArguments checkout https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.remove_unused_columns for more info

sadra-barikbin commented 1 month ago

model(**inputs) rejects those additional columns and raises error by only using remove_unused_columns=False.

sadra-barikbin commented 1 month ago

Minimal reproduction:

from typing import Dict, List
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from datasets import Dataset

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-14m")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
tokenizer.pad_token = tokenizer.eos_token

data = Dataset.from_dict({'prompt':['A', 'B'], 'completion': ['a', 'b']})

def tokenize(example: Dict[str, str]) -> Dict[str, List[int]]:
  return tokenizer(example['prompt'], f"{example['completion']}.", return_token_type_ids=True)

dataset = data.map(tokenize, remove_columns=['prompt', 'completion'])

args = TrainingArguments(
    output_dir="test",
    report_to='none',
    remove_unused_columns=False,
    max_steps=3,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    train_dataset=dataset,
)

trainer.train()
TypeError: GPTNeoXForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'
github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.