pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 815 forks source link

How to use TranslationDataset with an encoder-decoder architecture? #428

Closed kklemon closed 2 years ago

kklemon commented 5 years ago

In an encoder-decoder architecture (see for example [1]) one would normally have an encoder module that learns to create a sparse vector representation of some input data, e.g. a text to be translated and a decoder that is trained to produce the translation from those vector while fed with its own previous output.

During training one would just feed the target text to the decoder shifted by one timestep forward instead of its own outputs.

See also the following graphics for a visualization:

Now I'm struggling really hard to implement such an model using Torchtext, in particular with the TranslationDatasets, up to a point where I wonder if this is actually possible at all without applying some indecent hacks.

Especially I can't figure out how to let the dataset produce two outputs for the same language data but where one is shifted by a timestep as described above. Ideally I would think of just using two fields for those outputs with the corresponding parameters and preprocessing which are then "assigned" to the same language data, but it seems like that TranslationDataset does strictly expect to receive two and only two fields and there is no trivial way to work around this.

A solution I could think of might be to change the behavior of TranslationDataset in such a way, that it takes a list of field lists, with each such field list being assigned to one language.

But I also see general potential for rethinking how the relation between Dataset and its fields look like. The problem is in my opinion, that the current approach lets the dataset define the columns and then just assigns exactly one Field as processor for each data column/field. But this introduces a lot of inflexibility. There are many use cases, for example such an encoder-decoder model or models which should be conditioned by meta-text features, e.g. the length of a text sample, where it would be necessary to produce data samples where multiple data points originate from the same data field but with different Field instances/processing applied. I could think of to either let a Field "pull" data from Dataset, but which on the other hand would make Field less flexible and reusable or to replace the static one-to-one assignments in the Dataset by a kind of mapping mechanism which allows to map single data columns to multiple Field instances.

[1] Sequence to Sequence Learning with Neural Networks

mttk commented 5 years ago

Firstly, regarding the translation architecture: AFAIK, the way to handle the <sos> tokens can be seen in these two lines of Keon's minimal seq2seq:

  1. Fields - here you basically create the input to the decoder with the start token.
  2. Loss - here the token is "removed" for the output loss by slicing it out of the tensor. I'd say this might be a better way than using two outputs simultaneously as you're reducing your memory usage while slicing is relatively cheap.

Secondly, you've highlighted something that I'm actively trying to find a good fix for. Ideally, a Field shouldn't always be a predefined function mapping from input to output in 1:1 manner (as it is now), but a template to which you can attach as many mappings as you want (ex: 'tokenized' -> chars, 'untokenized' -> statistics, ...). The use case that I've had an issue with is getting word and character embeddings out of the same field, which is not currently supported. Similar cases do exist (yours is one of them, which is fortunately easier to work around) but extracting meta-text features is also one of those that bothered me.

Unfortunately, I cannot give you a timeline as to when this is going to happen, but it is one of the things on my urgent TODO list for this repo.

kklemon commented 5 years ago

Thank you for the detailed reply!

Firstly, regarding the translation architecture: AFAIK, the way to handle the tokens can be seen in these two lines of Keon's minimal seq2seq:

This is what I meant before when I said that with the current design, undecent hacks are required to achieve such things so I'm glad you also see it the same way.

I mean, as far as I know there is no official philosophy behind Torchtext except for the things that can be derived from the code, but I see Torchtext as a great library that puts a layer of abstraction on the data component that is usually very heavy in NLP projects. When working on NLP problems in the ML/DL field me and my colleagues usually spent roughly 90% or more of our time on the data pre-processing and even then the results are often very error prone, need a lot of debugging and are hard to pack and share. Torchtext already does increase our productvity a lot as you can now have single persons who can completely concentrate on the data preparation part in terms of defining Fields for the pre-processing and packing it into a Dataset while the ones responsible for building and train the model do not have to take care a single bit about how this has been done or what format they will receive.

Now from my view, Dataset represents an abstract dataset which may be read from different kind of sources and which exposes its parsed data fields while Field represents the processing over those data fields. A dataset can be seen as something fixed - it is just the parsed raw content from a data source - but towards the upper end, where processing and batching takes place, there should be more flexibility as there is currently a strong bound to the strict and static nature of the Dataset class.

As said before, I have different possible ideas in my mind to tackle this:

  1. Let Field pull data from Dataset. A Field could have a fixed dataset as data source assigned and some fixed field name from where it does "pull" its data upon processing. But I absolutely don't find this a good idea as it is against the nature of the Field class which seems to aim for reusability and hence should be decoupled from Dataset.
  2. Use an (ordered) Dict / tuple where source field names from the dataset map to list of fields which then will process data from those columns. Actually this approach doesn't appear to be very clean for me as you would end up with passing more or less complex dictionaries/tuple-structures but at least there would be most probably only little effort in implementing this approach and it could be done even in such a way, that backwards compatibility is given.
  3. Introduce a completely new object, let me called it Processor which takes over the binding between Dataset and Field. Dataset would be kind of degradated to a loader class and the assignment between data and Fields would be moved to that new class. In my opinion, this would be slightly more logical and coherent than (2) but it would also introduce more complexity and require large changes in the API.

I would be glad if we could talk and work on a solution together.

mttk commented 5 years ago

This is what I meant before when I said that with the current design, undecent hacks are required to achieve such things so I'm glad you also see it the same way.

In this specific case I'd argue that the hacky way is better than having two outputs (<sos> and no <sos>) since the memory overhead is much larger than the time. But yeah, the current fields are very rigid.

So far, the function of torchtext split to classes is as follows:

  1. Dataset is responsible for downloading, loading and storing data data from variable sources and folder structures in a row (instance)-based format.
  2. Example takes a row and converts each row into column (modal)-based format based on the fields provided.
  3. Fields handle the (1) tokenization (2) numericalization and (3) padding but are essentially functions mapping from raw text space into input space. Currently, only a 1:1 mapping is allowed.
  4. Iterator's job is to group instances into batches of similar size

Now, the issue we have here is mostly in the 1:1 nature that a Field currently has. My idea is not far from your 1.. I'd leave each field binded to a column (as we do need the preprocessing logic), but I'd keep the raw and preprocessed formats (when necessary). However, I wouldn't predefine the outputs of that field.

Along the lines of your 3., the Field would take the responsibility of preprocessing, while you can attach hooks (possibly predefined) to your fields. Each hook maps from the raw | preprocessed space into input space. You can assign as many as you want (since the sources are unchanged), and the outputs can be received either as a tensor tuple / dict.

Essentially, a use-case could be: you want character & word embeddings & lexical features from a certain Field. You create that field with some preprocessing and add a WordFeature, CharFeature and Custom hook. Each of these is a function constructing an output tensor based on the raw and/or preprocessed input data. The output tensors can be seen as a dict / namedtuple based on the names you gave the hooks. Then you can use them as inputs in any way you wish, and define custom hooks (either do operations over a string or a list of words, or both).

For this to happen, as you noted, there's quite a few changes that would need to be made to the API, and I'm thinking of the most painless way to do this. Do let me know if you have any suggestions.

keitakurita commented 5 years ago

I'm not sure if this is the appropriate place to discuss this, but I do have a few ideas for how to handle the 1:1 relationship between dataset field and Field object in the short term:

  1. Create a Union Field One way to accomplish the above change without significantly changing the API is to create a Union Field that manually maps a single field in the dataset to multiple field objects in Torchtext. How and when we process this field is up to debate, but I would imagine handling this relationship when creating the list of examples, adding multiple fields to the example for each field in the UnionField.

Example:

>>> fld = data.UnionField([data.Field(sequential=True), data.Field(sequential=False)])
>>> pos = data.TabularDataset(
...    path='data/pos/pos_wsj_train.tsv', format='tsv',
...    fields=[('text', fld),
...            ('labels', data.Field())])
  1. Allow 1-to-many relationships for datasets constructed using a dictionary mapping from field names to Field objects I'm not sure about other datasets, but the TabularDataset currently has the option of passing a dictionary mapping keys of the dataset to fields. I'm not entirely sure what would happen right now if we had multiple fields mapping to the same key in the dataset, but think this would be a feature Torchtext could support.

Example:

sentiment = data.TabularDataset(
...    path='data/sentiment/train.json', format='json',
...    fields={'sentence_tokenized': ('text', data.Field(sequential=True)),
...            'sentence_tokenized' : ('text2', data.Field(sequential=True)),
...            'sentiment_gold': ('labels', data.Field(sequential=False))})

In the long term though, I think it might be a good idea to rethink how we reason about these Fields. Ultimately, Field objects should really be pipelines that should be able to share arbitrary amounts of preprocessing. One way to do this without changing the API too much - which is sort of a combination of your ideas above - might be to have Fields be pipelines composed of Processor objects. The API of the Field itself would not change, it would just delegate the processing to Processor objects. Shared preprocessing would be expressed by sharing Processor objects. Datasets could expose a Processor object that just spits out the raw data, which Fields can subscribe to. By allowing Fields to use any Processor object as their point of input, we could reuse Fields flexibly, accomodating the use cases above (I think). I'm not sure how this would be converted into an API in practice though, and I may be missing something important...

abhinavarora commented 2 years ago

Closing issue as we have gotten rid of TranslationDataset and the issue is no more relevant.

1537