Closed CodeMonkey-GH closed 7 years ago
If you're loading doc2vec, you should load using: gensim.models.Doc2Vec.load(model)
Example code: https://github.com/jhlau/doc2vec/blob/master/infer_test.py
It works. Thank you for your answer and your pretrained_model.
Hi, I tried to test your Doc2Vec using pretrained word embeddings as below:
pretrained_emb = "/doc2vec-master/toy_data/pretrained_word_embeddings.txt"
saved_path = "/doc2vec-master/toy_data/model.bin"
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
docs = g.doc2vec.TaggedLineDocument(pretrained_emb)
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)
model.save(saved_path)
I got error messages as below:
TypeError Traceback (most recent call last)
Duplicate issue: https://github.com/jhlau/doc2vec/issues/2
is enwiki_dbow doc2vec pretrained_model trained on pre-porocessed data? I mean stemming and removing stop words or any other kind of pre-processing? thanks for your help
The paper describes all the preprocessing we did:
We experiment with two external corpora: (1) WIKI, the full collection of English Wikipedia;8 and (2) AP-NEWS, a collection of Associated Press English news articles from 2009 to 2015. We tokenise and lowercase the documents using Stanford CoreNLP (Manning et al., 2014), and treat each natural paragraph of an article as a document for doc2vec. After pre-processing, we have approximately 35M documents and 2B tokens for WIKI, and 25M and .9B tokens for AP-NEWS. Seeing that dbow trains faster and is a better model than dmpv from Section 3, we experiment with only dbow here.
Hi, when I use the enwiki_dbow doc2vec pretrained_model, I got a problem with loading the model. (I decompress the enwiki_dbow first)
model = gensim.models.Doc2Vec.load_word2vec_format( './enwiki_dbow/doc2vec.bin', binary=True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte