huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.99k stars 26.29k forks source link

Misleading type annotations in data collator #32528

Open sedol1339 opened 1 month ago

sedol1339 commented 1 month ago

Describe the bug

In DataCollatorWithPadding, the method __call__ has the following sitnature:

def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:

However, in the official tutorial on line batch = data_collator(samples) the collator actually accepts not List[Dict[str, Any]], but Dict[str, List[Any]], that is, the already batched sample.

This complicates understanding collators and implementing custom ones.

Steps to reproduce the bug

Run all steps in the tutorial: https://huggingface.co/learn/nlp-course/en/chapter3/2

Check the samples variable that is passed to data_collator.

Expected behavior

A more clear documentation and type annotation is desired.

Environment info

datasets 2.20.0 Python 3.10.14 Ubuntu 22

albertvillanova commented 1 month ago

This is the datastets repository, and the DataCollatorWithPadding belongs to the transformers repository.

amyeroberts commented 1 month ago

Hi @sedol1339, thanks for reporting!

Would you like to open a PR fixing the typo? This way you get the github contrib for spotting & suggesting the fix

sedol1339 commented 1 month ago

I am not sure this is a typo. DataCollatorWithPadding do accept List[Dict[str, Any]], as written in type annotation. But it also accept Dict[str, List[Any]] in some cases (see the example above).

amyeroberts commented 1 month ago

@sedol1339 If the collator accepts both, then we can indicate this with a Union type annotation

github-actions[bot] commented 23 hours ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.