How to build vocab from Glove embedding?

pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch

https://pytorch.org/text

BSD 3-Clause "New" or "Revised" License

3.49k stars 815 forks source link

How to build vocab from Glove embedding? #1350

Open hengee opened 3 years ago

hengee commented 3 years ago

❓ How to build vocab from Glove embedding?

Description

How to build vocab from Glove embedding?

I have gone through the documentation and the release update, I got to know that the Vectors object is not an attribute of the new Vocab object anymore.

But I would still want to build my vocab using Glove embedding or perhaps using Glove embedding in my model, anyway for the new API?

parmeet commented 3 years ago

Hi @hengee , you may access the Vector member stoiand use it to build your Vocab. For example, one way to do is

from torchtext.vocab import GloVe, vocab
myvec = GloVe()
myvocab = vocab(myvec.stoi)

Not that Vectors are nothing but convenient wrapper around original source word vectors (GloVe, FastText etc). In order to use them with your model, you can use nn.Embedding and initialize them with glove vectors. For example:

from torchtext.vocab import GloVe
import torch.nn
glove_vectors= GloVe()
# set freeze to false if you want them to be trainable
my_embeddings = torch.nn.Embedding.from_pretrained(glove_vectors.vectors,freeze=True)

You can easily modify the code here that initialize your text classification model to accept pre-trained word embeddings.

satyajitghana commented 3 years ago

@parmeet this clears a lot of things ! but sure these things are missing from official documentation 😞

Would love to have notebook style examples of the new changes (also will try to contribute if possible), as it brings in all the code in one place and easy to experiment with beforehand.

pedropgusmao commented 3 years ago

Thanks everyone for their replies!

from torchtext.vocab import GloVe, vocab
myvec = GloVe()
myvocab = vocab(myvec.stoi)
Not that Vectors are nothing but convenient wrapper around original source word vectors (GloVe, FastText etc). In order to use them with your model, you can use nn.Embedding and initialize them with glove vectors. For example:
from torchtext.vocab import GloVe
import torch.nn
glove_vectors= GloVe()
# set freeze to false if you want them to be trainable
my_embeddings = torch.nn.Embedding.from_pretrained(glove_vectors.vectors,freeze=True) 
You can easily modify the code here that initialize your text classification model to accept pre-trained word embeddings.

@parmeet Is there a way to combine vocab creation with special tokens and initialization of vectors? i.e. Can I use myvocab = vocab(myvec.stoi), then expand myvocab to include a default token (in case of token not found) and then use this myvocab (which contains vectors from GloVe) with nn.Embedding? Or should I first try and expand GloVe to include a '' with a null vector, followed by a myvocab = vocab(myvec.stoi), followed by setting the default value of myvocab to '<unknown_key> ?

parmeet commented 3 years ago

Thanks everyone for their replies!
from torchtext.vocab import GloVe, vocab
myvec = GloVe()
myvocab = vocab(myvec.stoi)
Not that Vectors are nothing but convenient wrapper around original source word vectors (GloVe, FastText etc). In order to use them with your model, you can use nn.Embedding and initialize them with glove vectors. For example:
from torchtext.vocab import GloVe
import torch.nn
glove_vectors= GloVe()
# set freeze to false if you want them to be trainable
my_embeddings = torch.nn.Embedding.from_pretrained(glove_vectors.vectors,freeze=True) 
You can easily modify the code here that initialize your text classification model to accept pre-trained word embeddings.
@parmeet Is there a way to combine vocab creation with special tokens and initialization of vectors? i.e. Can I use myvocab = vocab(myvec.stoi), then expand myvocab to include a default token (in case of token not found) and then use this myvocab (which contains vectors from GloVe) with nn.Embedding? Or should I first try and expand GloVe to include a '' with a null vector, followed by a myvocab = vocab(myvec.stoi), followed by setting the default value of myvocab to '<unknown_key> ?

Yes, you can expand the existing vocab module with new tokens using insert_token and append_token APIs.

As an example refer to following workflow where we want use Glove vectors and corresponding vocab for text classification model

from torchtext.vocab import GloVe, vocab
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
import torch
import torch.nn as nn

#define your model that accepts pretrained embeddings 
class TextClassificationModel(nn.Module):

    def __init__(self, pretrained_embeddings, num_class, freeze_embeddings = False):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag.from_pretrained(pretrained_embeddings, freeze = freeze_embeddings, sparse=True)
        self.fc = nn.Linear(pretrained_embeddings.shape[1], num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

train_iter = AG_NEWS(split = 'train')
num_class = len(set([label for (label, _) in train_iter]))
unk_token = "<unk>"
unk_index = 0
glove_vectors = GloVe()
glove_vocab = vocab(glove_vectors.stoi)
glove_vocab.insert_token("<unk>",unk_index)
#this is necessary otherwise it will throw runtime error if OOV token is queried 
glove_vocab.set_default_index(unk_index)
pretrained_embeddings = glove_vectors.vectors
pretrained_embeddings = torch.cat((torch.zeros(1,pretrained_embeddings.shape[1]),pretrained_embeddings))

#instantiate model with pre-trained glove vectors
glove_model = TextClassificationModel(pretrained_embeddings, num_class)

tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split = 'train')
example_text = next(train_iter)[1]
tokens = tokenizer(example_text)
indices = glove_vocab(tokens)
text_input = torch.tensor(indices)
offset_input = torch.tensor([0])

model_output = glove_model(text_input, offset_input)

parmeet commented 3 years ago

@parmeet this clears a lot of things ! but sure these things are missing from official documentation 😞

Would love to have notebook style examples of the new changes (also will try to contribute if possible), as it brings in all the code in one place and easy to experiment with beforehand.

@satyajitghana Glad it helps! Also, yes I do agree the documentation can be better. Appreciate your feedback. I have taken a note on this and we will try to complement code with notebook examples and design specs (wherever possible). Meanwhile I do encourage and welcome you to contribute. will be greatly appreciated :).

ZihaoZheng98 commented 3 years ago

The above methods have solved some of my doubts, thanks,but I still have a question about building vocab and embedding from custom dataset, hopefully you can solve my question, Thanks!

How to build vocab from custom dataset and load the vocab's corresponding embeddings from glove? Is the below method the best way?

use above method to get a glove vocab and glove vectors
build vocab from custom dataset
traverse the custom vocab, for each word, find the corresponding vector in glove vectors.

parmeet commented 3 years ago

The above methods have solved some of my doubts, thanks,but I still have a question about building vocab and embedding from custom dataset, hopefully you can solve my question, Thanks!

How to build vocab from custom dataset and load the vocab's corresponding embeddings from glove? Is the below method the best way?

use above method to get a glove vocab and glove vectors

build vocab from custom dataset

traverse the custom vocab, for each word, find the corresponding vector in glove vectors.

Yes, I think that's a viable approach. The only think I want to be careful about is what tokenizer to use to build the vocab for custom dataset. The pre-trained word vector embeddings like Glove are from pre-sentencepiece/sub-word times, so i am not sure if you would find many missing embeddings for tokens that results from sub-words segmentation.

bentrevett commented 3 years ago

In case this is useful for anyone, this is how I've been building a vocabulary and then initializing a nn.Embedding layer using FastText embeddings that are "aligned" with that vocabulary (e.g. vocab['hello'] = 5, and therefore nn.Embedding([5])) gives the FastText embedding for the string 'hello'):

min_freq = 5
special_tokens = ['<unk>', '<pad>']

vocab = torchtext.vocab.build_vocab_from_iterator(train_data['tokens'],
                                                  min_freq=min_freq,
                                                  specials=special_tokens)

# train_data['tokens'] is a list of a list of strings, i.e. [['hello', 'world'], ['goodbye', 'moon']], where ['hello', 'moon'] is the tokens corresponding to the first example in the training set.

pretrained_vectors = torchtext.vocab.FastText()

pretrained_embedding = pretrained_vectors.get_vecs_by_tokens(vocab.get_itos())

# vocab.get_itos() returns a list of strings (tokens), where the token at the i'th position is what you get from doing vocab[token]
# get_vecs_by_tokens gets the pre-trained vector for each string when given a list of strings
# therefore pretrained_embedding is a fully "aligned" embedding matrix

class NBoW(nn.Module):
    def __init__(self, vocab_size, embedding_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, output_dim)

    def forward(self, text):
        # text = [batch size, seq len]
        embedded = self.embedding(text)
        # embedded = [batch size, seq len, embedding dim]
        pooled = embedded.mean(dim=1)
        # pooled = [batch size, embedding dim]
        prediction = self.fc(pooled)
        # prediction = [batch size, output dim]
        return prediction

vocab_size = len(vocab)
embedding_dim = 300
output_dim = n_classes

model = NBoW(vocab_size, embedding_dim, output_dim, pad_index)

# super basic model here, important thing is the nn.Embedding layer that needs to have an embedding layer that is initialized as nn.Embedding(vocab_size, embedding_dim) with embedding_dim = 300 as that's the dimensions of the FastText embedding

model.embedding.weight.data = pretrained_embedding

# overwrite the model's initial embedding matrix weights with that of the pre-trained embeddings from FastText

One thing to note is that the embeddings tokens in your vocabulary but not in your vectors, i.e. they don't have a FastText embedding are initialized to a zero vector by default. This can be changed by using the unk_init argument of torchtext.vocab.FastText(), i.e. if you want them initialized from a Normal distribution then you can do something like:

def unk_init(x):
    return torch.randn_like(x)

vectors = torchtext.vocab.FastText(unk_init=unk_init)

pedropgusmao commented 3 years ago

@bentrevett , thanks for your example. In your experience, Is there a reason why random vectors would be better than null one for the <unk> ? Is <pad> already the null vector? Should padding be done at the beginning or at the end of a short sentence?

pedropgusmao commented 3 years ago

@bentrevett , would that be pretrained_embedding = pretrained_vectors.get_vecs_by_tokens(vocab.get_itos()) ?

bentrevett commented 3 years ago

@pedropgusmao

Is there a reason why random vectors would be better than null one for the ?

Not necessarily. However, in theory if a large percentage, say 50%, of your vocab weren't in your pre-trained vectors then you might want each of these vectors to be initialized to a different value so your model immediately knows they are different tokens. I've found initializing them all to zeros has worked fine though and this paper shows that initializing to zeros is pretty much equal to other initialization techniques.

Is <pad> already the null vector?

Both <unk> and <pad> aren't in the FastText vectors so both will get initialized to zero vectors. Both will change over time as the model trains. You can keep the <pad> vector fixed to zero by setting the padding_idx argument of the nn.Embedding layer to your pad id, i.e. vocab['<pad>'], but I've found that performance is usually identical without doing this.

Should padding be done at the beginning or at the end of a short sentence?

I've always done it at the end of a sequence, I think any difference should be negligible. There's probably an argument for padding at the start if you're using an RNN-based model, i.e. LSTM, so your final hidden state is from the last actual token in your sequence. However, your LSTM should learn to "close" the input gate when it sees a <pad> token, and you can always use pack_padded_sequence to avoid this.

would that be pretrained_embedding = pretrained_vectors.get_vecs_by_tokens(vocab.get_itos()) ?

Oops, fixed now.

hongjiedai commented 2 years ago

Thanks everyone for their replies!
from torchtext.vocab import GloVe, vocab
myvec = GloVe()
myvocab = vocab(myvec.stoi)
Not that Vectors are nothing but convenient wrapper around original source word vectors (GloVe, FastText etc). In order to use them with your model, you can use nn.Embedding and initialize them with glove vectors. For example:
from torchtext.vocab import GloVe
import torch.nn
glove_vectors= GloVe()
# set freeze to false if you want them to be trainable
my_embeddings = torch.nn.Embedding.from_pretrained(glove_vectors.vectors,freeze=True) 
You can easily modify the code here that initialize your text classification model to accept pre-trained word embeddings.
@parmeet Is there a way to combine vocab creation with special tokens and initialization of vectors? i.e. Can I use myvocab = vocab(myvec.stoi), then expand myvocab to include a default token (in case of token not found) and then use this myvocab (which contains vectors from GloVe) with nn.Embedding? Or should I first try and expand GloVe to include a '' with a null vector, followed by a myvocab = vocab(myvec.stoi), followed by setting the default value of myvocab to '<unknown_key> ?
Yes, you can expand the existing vocab module with new tokens using insert_token and append_token APIs.

As an example refer to following workflow where we want use Glove vectors and corresponding vocab for text classification model
from torchtext.vocab import GloVe, vocab
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
import torch
import torch.nn as nn

#define your model that accepts pretrained embeddings 
class TextClassificationModel(nn.Module):

    def __init__(self, pretrained_embeddings, num_class, freeze_embeddings = False):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag.from_pretrained(pretrained_embeddings, freeze = freeze_embeddings, sparse=True)
        self.fc = nn.Linear(pretrained_embeddings.shape[1], num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

train_iter = AG_NEWS(split = 'train')
num_class = len(set([label for (label, _) in train_iter]))
unk_token = "<unk>"
unk_index = 0
glove_vectors = GloVe()
glove_vocab = vocab(glove_vectors.stoi)

Thanks for every one reply. I notice that the voc size generated by the above code is wrong. I should set the min_freq to 0 to get the correct size. glove_vocab = vocab(glove_vectors.stoi, 0)

the-utkarshjain commented 2 years ago

@parmeet Is it possible to freeze the pre-trained weights but keep the newly concatenated embeddings trainable?