jhlau / doc2vec

Python scripts for training/testing paragraph vectors
Apache License 2.0
640 stars 191 forks source link

How does pretrained_emb parameter work? #30

Closed JumpyPizza closed 5 years ago

JumpyPizza commented 5 years ago

In the newest version of Gensim(3.8.0), I surprisingly found that "pretrained_emb=" param worked well, I've read the source code but couldn't find anything related to this param... My question is, does pretrained embeddings work like a lookup table? When a doc is trained, the words both in doc and pretrained_emb would be initialized as the pretrained vec, other words that're not in the pretrained_emb just initialize randomly(correct me if I'm wrong) If so, then the pretrained_emb and randomly initialized emb just train together to convergence, it would definitely converge much faster, but I wonder if theoretically the result gets better than word embs all initialized randomly from scratch(I've read your paper and it seems yes, the results get better in practice) . And another question is, if my pretrained_emb is large enough, say it definitely covers most of the vocab of the doc I wanna train, can I just use the pretrained emb to infer? e.g. Extract the words which are in the pretrained embs to represent the doc, lock these vectors up and only train the doc id to get the doc id emb? Thanks for your work! Would really appreciate it if u could answer!

jhlau commented 5 years ago

In the newest version of Gensim(3.8.0), I surprisingly found that "pretrained_emb=" param worked well, I've read the source code but couldn't find anything related to this param...

I can't comment much about the latest gensim, but I added pretrained_emb in my forked version of (fairly old) gensim last time, and you can see how it works here: https://github.com/jhlau/gensim/blob/develop/gensim/models/word2vec.py#L1021

The following answers are based on my implementation of pretrained_emb.

My question is, does pretrained embeddings work like a lookup table? When a doc is trained, the words both in doc and pretrained_emb would be initialized as the pretrained vec, other words that're not in the pretrained_emb just initialize randomly(correct me if I'm wrong)

The words are initialised with the pretrained_emb, and if they are not found in pretrained_emb, yes they are initialised randomly. Document embeddings are initialised randomly.

If so, then the pretrained_emb and randomly initialized emb just train together to convergence, it would definitely converge much faster, but I wonder if theoretically the result gets better than word embs all initialized randomly from scratch(I've read your paper and it seems yes, the results get better in practice) .

Yea it'll converge faster, and it should have better performance (since it's starting from a slightly optimal state instead of random state).

And another question is, if my pretrained_emb is large enough, say it definitely covers most of the vocab of the doc I wanna train, can I just use the pretrained emb to infer? e.g. Extract the words which are in the pretrained embs to represent the doc, lock these vectors up and only train the doc id to get the doc id emb?

Yeap, you can certainly do that.

JumpyPizza commented 5 years ago

Cheers! Thank you for your quick reply!

maohbao commented 4 years ago

Hi JumpyPizza, are you sure gensim 3.8 supports "pretrained_emb=" param? When I change "pretrained_emb" to any other name, the code can also run, it seems that the unsupported param will be ignored automatically by Doc2vec.

I also want to use pretrained word2vec when train doc2vec model in gensim 3.8, and also don't know how to do this, do you have any suggestion now?

JumpyPizza commented 4 years ago

Hi JumpyPizza, are you sure gensim 3.8 supports "pretrained_emb=" param? When I change "pretrained_emb" to any other name, the code can also run, it seems that the unsupported param will be ignored automatically by Doc2vec.

Yeah It seems so, as I didn't see any source code related to it and according to the author of gensim they don't support it in doc2vec.

I also want to use pretrained word2vec when train doc2vec model in gensim 3.8, and also don't know how to do this, do you have any suggestion now?

You could switch to jhlau's forked version of doc2vec, but it's an older version so you may need to rewrite some of the code, or there would be errors regarding of params etc.