pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

How to use pretrained embeddings (`Vectors`) in the new API? #1323

Open satyajitghana opened 3 years ago

satyajitghana commented 3 years ago

From what is see in the experimental module is that we pass a vocab object, which transforms the token into an unique integer.

https://github.com/pytorch/text/blob/e189c260e959ab966b1eaa986177549a6445858c/torchtext/experimental/datasets/text_classification.py#L50-L55

Thus something like ['hello', 'word'] might turn into [42, 43], this can then be fed into an nn.Embedding layer to get the corresponding embedding vector and so on.

What i dont't understand is how do i use

https://github.com/pytorch/text/blob/e189c260e959ab966b1eaa986177549a6445858c/torchtext/vocab.py#L475-L487

GloVe is a Vectors but it transforms ['hello', 'world'] into its corresponding Embedding tensor representation, this doesn't allow me to pad the sentences beforehand.

Also its weird that now i don't need a Vocab object, but in most of the modules i see that Vocab is built if its set to None.

https://github.com/pytorch/text/blob/e189c260e959ab966b1eaa986177549a6445858c/torchtext/experimental/datasets/text_classification.py#L85-L89

I don't really understand how am i supposed to interpret Vocab and Vectors and where should i use them? In nn.Module i.e. my model, or in data.Dataset, i.e. my dataset ? What if i want to fine tune the pretrained embeddings as well ?

Should both of them be used, or just either one ?

I couldn't even find good examples in https://github.com/pytorch/text/tree/master/examples/text_classification

I'm coming from the traditional torch vision library guy, so kudos to dumping the old legacy style torchtext, i really hated it, the new api's seem promising, but just a little confusing as of now.

satyajitghana commented 3 years ago

I partially solved it by using this style of approach:

class TweetsDataset(Dataset):
    """
    """

    URL = 'https://drive.google.com/uc?id=1gCEb9iRVYet15O4Tqvrj9Fjq9lVbSEQg'
    OUTPUT = 'tweets_cleaned.csv'

    def __init__(self, root, vocab=None, vectors=None, text_transforms=None, label_transforms=None, ngrams=1):
        """Initiate text-classification dataset.
        Args:
            vocab: Vocabulary object used for dataset.
        """

        super(self.__class__, self).__init__()

        if vocab and vectors:
            raise ValueError(f'both vocab and vectors cannot be provided')

        self.vocab = vocab
        self.vectors = vectors

        gdown.cached_download(self.URL, Path(root) / self.OUTPUT)

        self.generate_tweet_dataset(Path(root) / self.OUTPUT)

        tokenizer = get_tokenizer("spacy", language="en_core_web_sm")

        # the text transform can only work at the sentence level
        # the rest of tokenization and vocab is done by this class
        self.text_transform = text_f.sequential_transforms(tokenizer, text_f.ngrams_func(ngrams))

        self.vocab_transforms = text_f.sequential_transforms()
        self.vector_transforms = text_f.sequential_transforms()

        def build_vocab(data, transforms):
            def apply_transforms(data):
                for line in data:
                    yield transforms(line)
            return build_vocab_from_iterator(apply_transforms(data), len(data))

        if self.vectors:
            self.vector_transforms = text_f.sequential_transforms(
                partial(vectors.get_vecs_by_tokens, lower_case_backup=True)
            )
        elif self.vocab is None:
            # vocab is always built on the train dataset
            self.vocab = build_vocab(self.dataset["tweets_cleaned"], self.text_transform)

        if self.vocab:
            self.vocab_transforms = text_f.sequential_transforms(
                text_f.vocab_func(self.vocab), text_f.totensor(dtype=torch.long)
            )

...

    def collator_fn(self, raw_texts=False):
        def collate_fn(batch):

            labels, sequences, raw_texts = zip(*batch)

            labels = torch.stack(labels)

            lengths = torch.LongTensor([len(sequence) for sequence in sequences])

            if not self.vectors:
                pad_idx = self.get_vocab()['<pad>']
                sequences = torch.nn.utils.rnn.pad_sequence(sequences, 
                                                            padding_value = pad_idx,
                                                            batch_first=True
                                                            )
            if raw_texts:
                return labels, sequences, lengths, raw_texts
            else:
                return labels, sequences, lengths

        return collate_fn

but i'm still not sure how would i use the pretrained embeddings with a model, i mean i can do it in a non-elegant way, but i would like to know how its supposed to be done

hengee commented 3 years ago

Having the same problem, find it messy in the new API both 0.9 and 0.10

parmeet commented 3 years ago

Related issue #1350