Open tcqiuyu opened 6 years ago
Yeah, I believe that there isn't a way to do this currently other than 1. loading the Vectors
object and then 2. changing the stoi
and itos
of the already built vocab object (Field.vocab) to include all the words in the vectors file.
This is something that will be changed in the near future, but there is no easy way to do now.
Yeah, I believe that there isn't a way to do this currently other than 1. loading the
Vectors
object and then 2. changing thestoi
anditos
of the already built vocab object (Field.vocab) to include all the words in the vectors file.This is something that will be changed in the near future, but there is no easy way to do now.
Thanks for answering.
For now can I update all words in the pretrained embedding dict to a Counter then pass it to construct a Vocab? In this way will I get a complete word vector?
Yeah, that actually also sounds like a good option. You can create your own Vocab
instance from a Counter
and assign it to a Field
manually.
Yeah, that actually also sounds like a good option. You can create your own
Vocab
instance from aCounter
and assign it to aField
manually.
Cool, thanks a lot!
I'm writing a strategy for this. I've been thinking about adding these two params to Vocab's constructor: keep_rare_with_vectors
and add_vectors_vocab
(they are False by default).
keep_rare_with_vectors
: if True and the vectors are passed, then it will keep words that appear less than min_freq
but are in vectors vocab. add_vectors_vocab
: If True and the vectors are passed, then it will add words that are not in the datasets but are in the vectors vocab. But, this is going to create a very large vocab. So, if max_size
is passed, should we add all vectors words and display a warning saying that the maximum size of the vocabulary will be greater than max_size
?@mttk, what do you think?
@mtreviso could you submit a PR and fix the issue? Thanks.
I am new to pytorch and nlp. I have a question when I tried to build a model.
Since my training dataset is not so big, the size of its vocab is relatively small (around 5000). However, I want to deal with any other user input which could be out of this vocabulary.
The problem is, in the model I trained, the embedding layer's weight is based on the vectors of the field, not the whole word2vec pretrained embeddings. So I cannot modified it after the training is done.
I wondered is there any better approach to do it? Thanks in advance!