piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

Doc2Vec Segmentation Fault Windows and Linux #1578

Closed mullenba closed 7 years ago

mullenba commented 7 years ago

I've tried this basic code on both Linux and Windows. I'm trying to do some online training and it seems like after a couple passes it throws a seg fault.

Code to recreate problem.

from gensim.models.doc2vec import Doc2Vec, LabeledSentence, TaggedDocument

sentences = [('food', 'I like to eat broccoli and bananas.'),
             ('food', 'I ate a banana and spinach smoothie for breakfast.'),
             ('animals', 'Chinchillas and kittens are cute.'),
             ('animals', 'My sister adopted a kitten yesterday.'),
             ('animals', 'Look at this cute hamster munching on a piece of broccoli.')]

convSentences = []
for s in sentences:
    convSentences.append(LabeledSentence(tags=[s[0]], words = s[1].split()))

model = Doc2Vec(size=300, window=8, min_count=1, workers=1)

print("Pass 1:")
model.build_vocab([convSentences[0]])
model.train([convSentences[0]], total_examples=model.corpus_count)

print("Pass 2:")
model.build_vocab([convSentences[1]], update=True)
model.train([convSentences[1]], total_examples=model.corpus_count)

print("Pass 3:")
model.build_vocab([convSentences[2]], update=True)
model.train([convSentences[2]], total_examples=model.corpus_count)

print("Pass 4:")
model.build_vocab([convSentences[3]], update=True)
model.train([convSentences[3]], total_examples=model.corpus_count)

print("Pass 5:")
model.build_vocab([convSentences[4]], update=True)
model.train([convSentences[4]], total_examples=model.corpus_count)

Here's the output running in Windows Idle. Python 3.5.2

Warning (from warnings module):
  File "C:\Python35\lib\site-packages\gensim\utils.py", line 855
    warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
UserWarning: detected Windows; aliasing chunkize to chunkize_serial
Pass 1:
Pass 2:
Pass 3:

Passes 1-3 go quick, then a long pause and Linux throws a segmentation fault, Windows throws an unspecified error.

gojomo commented 7 years ago

Duplicate of #1019 – but this is a very useful minimal triggering case, thank you! I'll be closing this as a duplicate, for further discussion to occur there.

FYI, build_vocab(..., update=True) vocabulary-expansion feature was only developed & tested with respect to Word2Vec – thus this sort of bug when used via inheritance in Doc2Vec.