pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

Torchtext Dataset/Dataloader with generator #1781

Open dipta007 opened 2 years ago

dipta007 commented 2 years ago

❓ Questions and Help

Description

For a large corpus, I couldn't find any way to use an iterator in the dataset like the PyTorch dataset. Is it possible to make a dataset from only the generator or implement something like a PyTorch dataset object which will dynamically pull the data?

parmeet commented 2 years ago

Hi @dipta007, In torchtext 0.12 we have migrated our datasets on top of torchdata. You can look at datasets implementation that offer plenty of examples or refer the torchdata documentation for additional information on usage and available functionality in datapipes.

In general, datapipes offer constructing iterable Datasets and can be used with large corpus. For instance, unlike Map Style datasets, you do not have to read the whole data into memory to work with Datapipes. They work more like in streaming fashion.