Vocab vectors using complete pretrained-embedding?

pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch

https://pytorch.org/text

BSD 3-Clause "New" or "Revised" License

3.51k stars 811 forks source link

Vocab vectors using complete pretrained-embedding? #446

Open tcqiuyu opened 6 years ago

tcqiuyu commented 6 years ago

I am new to pytorch and nlp. I have a question when I tried to build a model.

Since my training dataset is not so big, the size of its vocab is relatively small (around 5000). However, I want to deal with any other user input which could be out of this vocabulary.

The problem is, in the model I trained, the embedding layer's weight is based on the vectors of the field, not the whole word2vec pretrained embeddings. So I cannot modified it after the training is done.

I wondered is there any better approach to do it? Thanks in advance!

mttk commented 6 years ago

Yeah, I believe that there isn't a way to do this currently other than 1. loading the Vectors object and then 2. changing the stoi and itos of the already built vocab object (Field.vocab) to include all the words in the vectors file.

This is something that will be changed in the near future, but there is no easy way to do now.

tcqiuyu commented 6 years ago

Yeah, I believe that there isn't a way to do this currently other than 1. loading the Vectors object and then 2. changing the stoi and itos of the already built vocab object (Field.vocab) to include all the words in the vectors file.

This is something that will be changed in the near future, but there is no easy way to do now.

Thanks for answering.

For now can I update all words in the pretrained embedding dict to a Counter then pass it to construct a Vocab? In this way will I get a complete word vector?

mttk commented 6 years ago

Yeah, that actually also sounds like a good option. You can create your own Vocab instance from a Counter and assign it to a Field manually.

tcqiuyu commented 6 years ago

Yeah, that actually also sounds like a good option. You can create your own Vocab instance from a Counter and assign it to a Field manually.

Cool, thanks a lot!

mtreviso commented 5 years ago

I'm writing a strategy for this. I've been thinking about adding these two params to Vocab's constructor: keep_rare_with_vectors and add_vectors_vocab (they are False by default).

keep_rare_with_vectors: if True and the vectors are passed, then it will keep words that appear less than min_freq but are in vectors vocab.
add_vectors_vocab: If True and the vectors are passed, then it will add words that are not in the datasets but are in the vectors vocab. But, this is going to create a very large vocab. So, if max_size is passed, should we add all vectors words and display a warning saying that the maximum size of the vocabulary will be greater than max_size?

@mttk, what do you think?

zhangguanheng66 commented 5 years ago

@mtreviso could you submit a PR and fix the issue? Thanks.