Convert iterator-style raw datasets to map-style raw datasets

parmeet commented 3 years ago

🚀 Feature

Motivation

torchtext provide several open source nlp datasets in raw form. These datasets are provide as Iterables. Although there are times when user may prefer map-style datasets.

Pitch

We would like to implement a functionality that would convert iterable datasets into map-style datasets. This functionality can either be implemented as functional where-in the input would be raw iterable dataset. An alternative implementation would be to directly provide it as a member function for raw dataset iterable class.

Alternatives

A naive approach would be to simply materialize iterator into list as follows:

from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
train_dataset = list(train_iter)

Unfortunately, passing list to PyTorch DataLoader would have memory regressions for multi-processing. More details below.

Additional context

Note that the solution stated in Alternatives would have issues when doing multi-process data loading. This issue is discussed in details here (https://github.com/pytorch/pytorch/issues/13246). Other options would be to explore numpy tensors, but seems like it would suffer from same issue as lists, refer this this comment (https://github.com/pytorch/pytorch/issues/13246#issuecomment-445770039).

Potential Solution

Thanks to @cpuhrsch for proposing it here (https://github.com/pytorch/text/pull/1281#discussion_r618908965) One idea would be to create a thin C++ wrapper where in the data is stored in std::array data-structure. This wrapper can then be binded in Python using pybind11. One potential caveat though is performance regression due to overhead of querying through binding functions. Hopefully this cost is negligible compared to downstream processing.

cc: @cpuhrsch

cpuhrsch commented 3 years ago

I think we should also consider a functional that will do this operation for arbitrary iterables, so effectively a factory function for this kind of Dataset. Next to memory issues with lists of strings we should also keep in mind the shared memory semantics. Can we make this datastructure share-able across workers? Something that might help here (and is new in Python 3.8) is shared_memory and could maybe make this easier.

Kousik-Sasmal commented 1 year ago

@parmeet @cpuhrsch Is this issue still open? If yes, can you please assign this to me?

pytorch / text

Convert iterator-style raw datasets to map-style raw datasets #1296

🚀 Feature