huggingface / notebooks

Notebooks using the Hugging Face libraries 🤗
Apache License 2.0
3.59k stars 1.51k forks source link

tokenizer warning for Multiple choice #434

Open jaideep11061982 opened 1 year ago

jaideep11061982 commented 1 year ago

https://github.com/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb I think when we do tokenizer.pad in collator , its a slow operation so there is warning that suggests that when we do tokenizer( ) we can always padding =True there . Doing it inside collator slows the training, any way we can use padding option of tokenizer directly ?

accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)