Open sedol1339 opened 1 month ago
This is the datastets
repository, and the DataCollatorWithPadding
belongs to the transformers
repository.
Hi @sedol1339, thanks for reporting!
Would you like to open a PR fixing the typo? This way you get the github contrib for spotting & suggesting the fix
I am not sure this is a typo. DataCollatorWithPadding
do accept List[Dict[str, Any]]
, as written in type annotation. But it also accept Dict[str, List[Any]]
in some cases (see the example above).
@sedol1339 If the collator accepts both, then we can indicate this with a Union
type annotation
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
In
DataCollatorWithPadding
, the method__call__
has the following sitnature:However, in the official tutorial on line
batch = data_collator(samples)
the collator actually accepts notList[Dict[str, Any]]
, butDict[str, List[Any]]
, that is, the already batched sample.This complicates understanding collators and implementing custom ones.
Steps to reproduce the bug
Run all steps in the tutorial: https://huggingface.co/learn/nlp-course/en/chapter3/2
Check the
samples
variable that is passed todata_collator
.Expected behavior
A more clear documentation and type annotation is desired.
Environment info
datasets 2.20.0 Python 3.10.14 Ubuntu 22