I just saw that the gensim version provided by Arch is pretty old. I am going to test with the newest version from pip and report back if the problem still persists. Ignore for now ...
Sorry for that. Should have checked before ...
Description
I trained a Doc2Vec model from scratch and directly after training I am getting reasonable results. After saving the model and loading it again, I get completely different and more or less random results. Am I doing something wrong? Is this a bug? ...
Steps/Code/Corpus to Reproduce
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing
wiki = WikiCorpus("/data/enwiki-latest-pages-articles.xml.bz2")
class TaggedWikiDocument(object):
def __init__(self, wiki):
self.wiki = wiki
self.wiki.metadata = True
def __iter__(self):
for content, (page_id, title) in self.wiki.get_texts():
yield TaggedDocument(content, [title])
documents = TaggedWikiDocument(wiki)
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=10, iter =10, workers=cores)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.iter)
pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))
# Here I get a good results
model.save("./doc2vecmodel.mod")
model = Doc2Vec.load("./doc2vecmodel.mod")
pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))
# Here I get more or less random results
I just saw that the gensim version provided by Arch is pretty old. I am going to test with the newest version from
pip
and report back if the problem still persists. Ignore for now ... Sorry for that. Should have checked before ...Description
I trained a Doc2Vec model from scratch and directly after training I am getting reasonable results. After saving the model and loading it again, I get completely different and more or less random results. Am I doing something wrong? Is this a bug? ...
Steps/Code/Corpus to Reproduce
Expected Results (before save and load)
Actual Results (after save and load)
Versions