pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

Migrating from Field, TabularDataset, BucketIterator #969

Open david-waterworth opened 4 years ago

david-waterworth commented 4 years ago

Is there a guide on how to migrate from the to be deprecated Field, TabularDataset, BucketIterator? Specifically for custom datasets - I need to pad and add start / end of sequence tokens for a transformer seq2seq model.

I've been looking through the code i.e. https://github.com/pytorch/text/blob/master/torchtext/experimental/datasets/translation.py to try and work out what the intention of the deprecation is and what resources are available.

I've created my own version which pretty much works. But what I noticed is when the Vocab is built i.e. https://github.com/pytorch/text/blob/db31b5dc046345c2ef70a7d578757053e0bd3bd9/torchtext/experimental/datasets/translation.py#L56 unlike the version called within Field there is no way to add specials like '\<sos>' and '\<eos>'. The build_vocab implementation i.e. https://github.com/pytorch/text/blob/db31b5dc046345c2ef70a7d578757053e0bd3bd9/torchtext/experimental/datasets/translation.py#L10 is different to the version used by Field since it allows passing of **kwargs https://github.com/pytorch/text/blob/db31b5dc046345c2ef70a7d578757053e0bd3bd9/torchtext/data/field.py#L274

Also the tensors aren't padded (I guess that occurred in either TabularDataset or BucketIterator?).

Is the intention to add both padding and '\<sos>' / '\<eos>' using a collate_fn? This would make some sense except that you would still have to add the additional specials to the vocab - which makes debugging a little harder as it's nice to add them before the actual vocab so that their values are constant and recognisable.

kqf commented 4 years ago

I have the same issue. Previously I was using Field to preprocess my data and this code to produce the training dataset:

examples = [Example.fromlist(f, fields) for f in zip(*preprocessed)]
dataset = Dataset(examples, fields)

Now all these things are deprecated, and there is no examples how to migrate. I guess, the the question is: what is the intended way to use the library for custom datasets now?

zhangguanheng66 commented 4 years ago

We are going to put together an example to show how to building the pipeline with the new building blocks by 0.8.0 release. At the same time, all the experimental datasets (link) are based on the new abstraction.

faleandroid commented 3 years ago

We are going to put together an example to show how to building the pipeline with the new building blocks by 0.8.0 release. At the same time, all the experimental datasets (link) are based on the new abstraction.

have you written this guide? if so, could you please link it?

zhangguanheng66 commented 3 years ago

We are going to put together an example to show how to building the pipeline with the new building blocks by 0.8.0 release. At the same time, all the experimental datasets (link) are based on the new abstraction.

have you written this guide? if so, could you please link it?

For now, please still refer to the examples in examples folder, including those data pipeline examples. Instead of Field and Iterator, we suggest to switch to the building blocks in the data pipeline examples and pytorch DataLoader.