jhlau / doc2vec

Python scripts for training/testing paragraph vectors
Apache License 2.0
640 stars 191 forks source link

About the doc2vec pretrained_model #1

Closed CodeMonkey-GH closed 7 years ago

CodeMonkey-GH commented 7 years ago

Hi, when I use the enwiki_dbow doc2vec pretrained_model, I got a problem with loading the model. (I decompress the enwiki_dbow first)

model = gensim.models.Doc2Vec.load_word2vec_format( './enwiki_dbow/doc2vec.bin', binary=True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

jhlau commented 7 years ago

If you're loading doc2vec, you should load using: gensim.models.Doc2Vec.load(model)

Example code: https://github.com/jhlau/doc2vec/blob/master/infer_test.py

CodeMonkey-GH commented 7 years ago

It works. Thank you for your answer and your pretrained_model.

learnercat commented 7 years ago

Hi, I tried to test your Doc2Vec using pretrained word embeddings as below:

pretrained_emb = "/doc2vec-master/toy_data/pretrained_word_embeddings.txt" saved_path = "/doc2vec-master/toy_data/model.bin" logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) docs = g.doc2vec.TaggedLineDocument(pretrained_emb) model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch) model.save(saved_path)

I got error messages as below:

TypeError Traceback (most recent call last)

in () 2 #docs = g.doc2vec.TaggedLineDocument(train_corpus) 3 docs = g.doc2vec.TaggedLineDocument(pretrained_emb) ----> 4 model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch) 5 #model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, iter=train_epoch) 6 #save model /home/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.pyc in __init__(self, documents, dm_mean, dm, dbow_words, dm_concat, dm_tag_count, docvecs, docvecs_mapfile, comment, trim_rule, **kwargs) 605 super(Doc2Vec, self).__init__( 606 sg=(1 + dm) % 2, --> 607 null_word=dm_concat, **kwargs) 608 609 self.load = call_on_class_only TypeError: __init__() got an unexpected keyword argument 'pretrained_emb'
jhlau commented 7 years ago

Duplicate issue: https://github.com/jhlau/doc2vec/issues/2

forouq commented 7 years ago

is enwiki_dbow doc2vec pretrained_model trained on pre-porocessed data? I mean stemming and removing stop words or any other kind of pre-processing? thanks for your help

jhlau commented 7 years ago

The paper describes all the preprocessing we did:

We experiment with two external corpora: (1) WIKI, the full collection of English Wikipedia;8 and (2) AP-NEWS, a collection of Associated Press English news articles from 2009 to 2015. We tokenise and lowercase the documents using Stanford CoreNLP (Manning et al., 2014), and treat each natural paragraph of an article as a document for doc2vec. After pre-processing, we have approximately 35M documents and 2B tokens for WIKI, and 25M and .9B tokens for AP-NEWS. Seeing that dbow trains faster and is a better model than dmpv from Section 3, we experiment with only dbow here.