piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.58k stars 4.37k forks source link

Saving and then loading corrupts Doc2Vec model #2117

Closed Paethon closed 6 years ago

Paethon commented 6 years ago

I just saw that the gensim version provided by Arch is pretty old. I am going to test with the newest version from pip and report back if the problem still persists. Ignore for now ... Sorry for that. Should have checked before ...

Description

I trained a Doc2Vec model from scratch and directly after training I am getting reasonable results. After saving the model and loading it again, I get completely different and more or less random results. Am I doing something wrong? Is this a bug? ...

Steps/Code/Corpus to Reproduce

from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

wiki = WikiCorpus("/data/enwiki-latest-pages-articles.xml.bz2")

class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument(content, [title])

documents = TaggedWikiDocument(wiki)

cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=10, iter =10, workers=cores)

model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.iter)

pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))
# Here I get a good results

model.save("./doc2vecmodel.mod")
model = Doc2Vec.load("./doc2vecmodel.mod")
pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))
# Here I get more or less random results

Expected Results (before save and load)

[('Multi-task learning', 0.7568818926811218),
 ('Pattern recognition', 0.749046802520752),
 ('Statistical classification', 0.7360684871673584),
 ('Linear classifier', 0.7219105362892151),
 ('Prior knowledge for pattern recognition', 0.7102837562561035),
 ('Supervised learning', 0.7076001167297363),
 ('Naive Bayes classifier', 0.7071056365966797),
 ('Support vector machine', 0.7007730603218079),
 ('Statistical learning theory', 0.6968303322792053),
 ('Feature selection', 0.6903674602508545),
 ('Regularization (mathematics)', 0.6844260692596436),
 ('Meta learning (computer science)', 0.6837587952613831),
 ('Early stopping', 0.6815727353096008),
 ('Similarity learning', 0.6798273324966431),
 ('Predictive analytics', 0.6749750375747681),
 ('Artificial neural network', 0.6720495223999023),
 ('Empirical risk minimization', 0.6702870726585388),
 ('Structured prediction', 0.6696684956550598),
 ('Perceptron', 0.6685298085212708),
 ('Boosting (machine learning)', 0.6679200530052185)]

Actual Results (after save and load)

[('Mossant', 0.3701874911785126),
 ('Filatima fuliginea', 0.3571828305721283),
 ('Heinrich Barth', 0.33388739824295044),
 ('Caleb Suri', 0.33307966589927673),
 ('Wriddhiman Saha', 0.33189573884010315),
 ('Priyaa Lal', 0.3283548057079315),
 ("2nd Queen Victoria's Own Rajput Light Infantry", 0.3266987204551697),
 ('United States presidential election in Missouri, 1988', 0.32637813687324524),
 ('Jim Payne (golfer)', 0.32357922196388245),
 ('Reflexe', 0.3232496678829193),
 ('Fobbing Marsh', 0.3231257498264313),
 ('Street Fighter 2010: The Final Fight', 0.32258525490760803),
 ('Saint-Thibault-des-Vignes', 0.3203728199005127),
 ('Frederick, Count of Verdun', 0.3182566165924072),
 ('Final Justice (1997 film)', 0.31800541281700134),
 ('List of national universities in South Korea', 0.3160543441772461),
 ('Lake Lafayette', 0.31462588906288147),
 ('Secrets of the Muse', 0.31268560886383057),
 ('Baihe Subdistrict', 0.31175172328948975),
 ('Robert III de Sablé', 0.3113846480846405)]

Versions

Linux-4.17.3-1-ARCH-x86_64-with-arch-Arch-Linux
Python 3.6.5 (default, May 11 2018, 04:00:52) 
[GCC 8.1.0]
NumPy 1.14.5
SciPy 1.1.0
gensim 2.3.0
FAST_VERSION 1
Paethon commented 6 years ago

I just tried the newest version of gensim, and saving and loading works as expected now!