pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 815 forks source link

Support for Multiple Fields on a Data Column #212

Open kolloldas opened 6 years ago

kolloldas commented 6 years ago

First of all thanks for this great time saver of a library!

I tried using SequenceTaggingDataset for the Conll2003 NER task which needs both word and character embeddings for high accuracy. The newly added NestedField works very well but there doesn't seem to be a way to currently apply multiple fields to a single data column. I needed something like:

fields = [(('inputs_word', 'inputs_char'), (word_field, char_field)), ... ('ner_tag', tag_field)]

So I made a few modifications. Specifically in example.py:

    @classmethod
    def fromlist(cls, data, fields):
        ex = cls()
        for (name, field), val in zip(fields, data):
            if field is not None:
                if isinstance(val, six.string_types):
                    val = val.rstrip('\n')
                # Handle field tuples
                if isinstance(name, tuple):
                    for n, f in zip(name, field):
                        setattr(ex, n, f.preprocess(val))
                else:
                    setattr(ex, name, field.preprocess(val))
        return ex

And in dataset.py:

class Dataset(torch.utils.data.Dataset):
...
  def __init__(self, examples, fields, filter_pred=None):
   ...
        self.fields = dict(fields)
        # Unpack field tuples
        for n, f in list(self.fields.items()):
            if isinstance(n, tuple):
                self.fields.update(zip(n, f))
                del self.fields[n]

It seems to work fine but is there a better way to do this? Did I miss something?

Thanks!

jekbradbury commented 6 years ago

Yes, I think this works. The other way to do it would be to pass fromlist a list where the relevant column appears twice, something that I believe is supported by the TSV/CSV and/or JSON loaders (I think we did that for the SNLI dataset class?) But adding something like this would make things more convenient. If you make a PR before I get to it, can you also update TabularDataset to use this code path?

kolloldas commented 6 years ago

Sure, will update TabularDataset as required when I make the PR. I think any calls that end in Example.fromdict will support multiple fields. So I'll check for CSV/TSV without headers which use Example.fromlist

elkotito commented 4 years ago

Isn't it already merged https://github.com/pytorch/text/pull/222?