Open david-waterworth opened 4 years ago
I have the same issue. Previously I was using Field
to preprocess my data and this code to produce the training dataset:
examples = [Example.fromlist(f, fields) for f in zip(*preprocessed)]
dataset = Dataset(examples, fields)
Now all these things are deprecated, and there is no examples how to migrate. I guess, the the question is: what is the intended way to use the library for custom datasets now?
We are going to put together an example to show how to building the pipeline with the new building blocks by 0.8.0 release. At the same time, all the experimental datasets (link) are based on the new abstraction.
We are going to put together an example to show how to building the pipeline with the new building blocks by 0.8.0 release. At the same time, all the experimental datasets (link) are based on the new abstraction.
have you written this guide? if so, could you please link it?
We are going to put together an example to show how to building the pipeline with the new building blocks by 0.8.0 release. At the same time, all the experimental datasets (link) are based on the new abstraction.
have you written this guide? if so, could you please link it?
For now, please still refer to the examples in examples
folder, including those data pipeline examples. Instead of Field
and Iterator
, we suggest to switch to the building blocks in the data pipeline examples and pytorch DataLoader.
Is there a guide on how to migrate from the to be deprecated Field, TabularDataset, BucketIterator? Specifically for custom datasets - I need to pad and add start / end of sequence tokens for a transformer seq2seq model.
I've been looking through the code i.e. https://github.com/pytorch/text/blob/master/torchtext/experimental/datasets/translation.py to try and work out what the intention of the deprecation is and what resources are available.
I've created my own version which pretty much works. But what I noticed is when the Vocab is built i.e. https://github.com/pytorch/text/blob/db31b5dc046345c2ef70a7d578757053e0bd3bd9/torchtext/experimental/datasets/translation.py#L56 unlike the version called within Field there is no way to add
specials
like '\<sos>' and '\<eos>'. Thebuild_vocab
implementation i.e. https://github.com/pytorch/text/blob/db31b5dc046345c2ef70a7d578757053e0bd3bd9/torchtext/experimental/datasets/translation.py#L10 is different to the version used byField
since it allows passing of **kwargs https://github.com/pytorch/text/blob/db31b5dc046345c2ef70a7d578757053e0bd3bd9/torchtext/data/field.py#L274Also the tensors aren't padded (I guess that occurred in either TabularDataset or BucketIterator?).
Is the intention to add both padding and '\<sos>' / '\<eos>' using a collate_fn? This would make some sense except that you would still have to add the additional specials to the vocab - which makes debugging a little harder as it's nice to add them before the actual vocab so that their values are constant and recognisable.