Open sadra-barikbin opened 1 month ago
cc @muellerzr @SunMarc
@sadra-barikbin
for the trainer API you should specify remove_unused_columns=False
in the TrainingArguments
checkout https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.remove_unused_columns for more info
model(**inputs)
rejects those additional columns and raises error by only using remove_unused_columns=False
.
Minimal reproduction:
from typing import Dict, List
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from datasets import Dataset
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-14m")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
tokenizer.pad_token = tokenizer.eos_token
data = Dataset.from_dict({'prompt':['A', 'B'], 'completion': ['a', 'b']})
def tokenize(example: Dict[str, str]) -> Dict[str, List[int]]:
return tokenizer(example['prompt'], f"{example['completion']}.", return_token_type_ids=True)
dataset = data.map(tokenize, remove_columns=['prompt', 'completion'])
args = TrainingArguments(
output_dir="test",
report_to='none',
remove_unused_columns=False,
max_steps=3,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
train_dataset=dataset,
)
trainer.train()
TypeError: GPTNeoXForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi there!
Currently, columns not used by the model are removed in
self.get_*_dataloader()
upon data loader creation, but one might want to have them incompute_metrics
(wheninclude_inputs_for_metrics=True
).My case is fine-tuning on prompt-completion's and I use tokenizer's
token_type_ids
as a mask to compute accuracy only on the completion tokens.To this end, the best way I've come up with is to keep that column in dataset & data loader using
remove_unused_columns=False
and then remove it inself._prepare_inputs()
by overriding it.Is there a better way to achieve this? Generally, isn't better to move removing unused columns logic to
self._prepare_inputs
if the logic serves only as the gatekeeper formodel(**inputs)
?