pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 810 forks source link

Add `LengthSetterIterDataPipe` to all torchtext datasets #1943

Open Nayef211 opened 2 years ago

Nayef211 commented 2 years ago

🚀 Feature

We want to add the LengthSetterIterDataPipe to the end of all torchtext datasets. This will allow us to call len() on the datapipe object and prevent errors like TypeError: DataPipe instance doesn't have valid length.

Motivation See https://github.com/pytorch/tutorials/pull/1954#discussion_r993951194 for discussion

Additional Context Once this has been done for the Multi30k dataset, we can remove the conversion of the datapipe to a list in https://github.com/pytorch/tutorials/pull/1954 (i.e. list(train_dataloader)) since it would cause all data in the dataset to materialize. This can lead to OOMs for very large datasets.

moDallel commented 1 year ago

Hello ! We are a group of students in second year in engineering school. We are currently interested in resolving this issue as a school project. Please let me know, if we can have your permision to contribute on this issue.

joecummings commented 1 year ago

@moDallel - We welcome contributions! Thanks for your interest in the project.