pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

Allow python generator when initializing Dataset object #185

Open ShuyangCao opened 6 years ago

ShuyangCao commented 6 years ago

I follow OpenNMT's implement to use torchtext and find that if filter_pred argument is not None when initializing torchtext.data.Dataset object, the type of examples argument can be a python generator. But if filter_pred is None, the type of examples can only be a list.

I suggest allowing the usage of python generator by modifying the code in file torchtext/data/dataset.py.

        if filter_pred is not None:
            examples = list(filter(filter_pred, examples))
        self.examples = list(examples)
jekbradbury commented 6 years ago

We actually do support generators (without ever listifying them) under certain fairly specific conditions (the iterator can't globally sort or shuffle; see https://github.com/pytorch/text/issues/176#issuecomment-353546176); I just added a PR that allows this generator support even with filter_pred.

JianyuZhan commented 6 years ago

Hi, @jekbradbury , thanks for the new PR. When will it be available on PyPI? Which version?

jekbradbury commented 6 years ago

Aiming to work through the PR backlog this weekend and release as v0.2.1

JianyuZhan commented 6 years ago

Okay, thanks! Would you please ping me when it is released? Because we(OpenNMT-py) now pin the version to 0.1.1 due to some problem in 0.2.0: https://github.com/OpenNMT/OpenNMT-py/issues/368.

I think I can address this along with tackling the lazy dataset problem when 0.2.1 is available.