piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.38k forks source link

Segmentation fault using build_vocab(..., update=True) for Doc2Vec #1019

Open danoneata opened 7 years ago

danoneata commented 7 years ago

Hello!

I'm performing online learning for Doc2Vec, that is, I learn an initial model on a set of tagged documents and try to update the model on a new set of tagged documents. If the second set contains new tags (tags that were not present in the initial set of documents), then I usually get segmentation fault (this behavior is not deterministic, but it happens most of time).

Below you can find a toy example that reproduces the issue; and here is the output of that code. I'm using Python 3.4.3 and Gensim 0.13.3.

I've debugged with gdb and I've got the following output:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff9a4f8700 (LWP 29422)]
__pyx_f_6gensim_6models_13doc2vec_inner_fast_document_dm_hs (__pyx_v_learn_hidden=1, __pyx_v_size=300, __pyx_v_work=0x7fff80001480, __pyx_v_alpha=0.0250000004, __pyx_v_syn1=0x1693ce0, __pyx_v_neu1=0x7fff80001a00, __pyx_v_word_code_len=6,
    __pyx_v_word_code=<optimized out>, __pyx_v_word_point=0x13fe410) at ./gensim/models/doc2vec_inner.c:2078

I'm willing to help fixing this issue if someone can provide me some guidance. Thanks!

Sample code that reproduces the issue:

import logging

from gensim.models.doc2vec import (
    Doc2Vec,
    TaggedDocument,
)

logging.basicConfig(
    format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s',
    level=logging.DEBUG,
)

def to_str(d):
    return ", ".join(d.keys())

SENTS = [
    "anecdotal using a personal experience or an isolated example instead of a sound argument or compelling evidence",
    "plausible thinking that just because something is plausible means that it is true",
    "occam razor is used as a heuristic technique discovery tool to guide scientists in the development of theoretical models rather than as an arbiter between published models",
    "karl popper argues that a preference for simple theories need not appeal to practical or aesthetic considerations",
    "the successful prediction of a stock future price could yield significant profit",
]

SENTS = [s.split() for s in SENTS]

def main():
    sentences_1 = [
        TaggedDocument(SENTS[0], tags=['SENT_0']),
        TaggedDocument(SENTS[1], tags=['SENT_0']),
        TaggedDocument(SENTS[2], tags=['SENT_1']),
    ]

    sentences_2 = [
        TaggedDocument(SENTS[3], tags=['SENT_1']),
        TaggedDocument(SENTS[4], tags=['SENT_2']),
    ]

    model = Doc2Vec(min_count=1, workers=1)

    model.build_vocab(sentences_1)
    model.train(sentences_1)

    print("-- Base model")
    print("Vocabulary:", to_str(model.vocab))
    print("Tags:", to_str(model.docvecs.doctags))

    model.build_vocab(sentences_2, update=True)
    model.train(sentences_2)

    print("-- Updated model")
    print("Vocabulary:", to_str(model.vocab))
    print("Tags:", to_str(model.docvecs.doctags))

if __name__ == '__main__':
    main()
tmylk commented 7 years ago

Vocab expansion for doc2vec is not supported yet so labelled this as a new feature.

korostelevm commented 7 years ago

I ran into this also.. Was taking a look at how updating vocabulary worked in the online for word2vec and tried to replicate the update for doc2vec's doctags.

It seems to work - as in I can train the model with a few examples and then load it, train it more and it will return new doctags and vocabulary in the similarity functions. When storing the updated model I do have to give it a different filename otherwise the segmentation fault still happens. But the weights look like they get updated to. Here are my edits to the original doc2vec.py

In the DocvecsArray class:

Added function to store new doctags from new training in a new property self.new_doctags = {}

  def note_newdoctag(self, key, document_no, document_length, model):

        if isinstance(key, int):
            self.max_rawint = max(self.max_rawint, key)
        else:
            if key in self.doctags:
                self.doctags[key] = self.doctags[key].repeat(document_length)
            else:
                self.doctags[key] = Doctag(len(self.offset2doctag), document_length, 1)
                self.new_doctags[key] = Doctag(len(self.offset2doctag), document_length, 1)
                self.offset2doctag.append(key)

        self.new_count = self.max_rawint + 1 + len(self.offset2doctag)

Also an update weights function:

    def update_weights(self, model):
        gained_tags = len(self.doctags) - len(self.new_doctags)
        gained_tags =  len(self.new_doctags)
        newsyn0 = empty((gained_tags, model.vector_size), dtype=REAL)

        # # randomize the remaining tags
        for i in xrange(len(self.new_doctags), len(self.doctags)):
            # construct deterministic seed from word AND seed argument
            newsyn0[i - len(self.doctag_syn0)] = model.seeded_vector(i + model.seed)
        self.doctag_syn0 = vstack([self.doctag_syn0, newsyn0])
        self.doctag_syn0_lockf = ones(len(self.doctags), dtype=REAL)  # zeros suppress learning

In the Doc2Vec class: Then in the scan_vocab function of the Doc2Vec class, call the note_newdoctag function when build_vocab is called with update=True:

for document_no, document in enumerate(documents):
            ...
            if not update:
                for tag in document.tags:
                    self.docvecs.note_doctag(tag, document_no, document_length, self)
            else:
                for tag in document.tags:
                    self.docvecs.note_newdoctag(tag, document_no, document_length, self)
             ...

When finalize_vocab is called in the super class it doesnt run my new update weights in DocvecsArray so I dropped finalize_vocab into Doc2Vec and added

self.docvecs.update_weights(self)

at the end of it.

Here is a link to the full file: https://gist.github.com/korostelevm/d48c80f296516deef045e5aa5dca1518 I just import import doc2vec_online as doc2vec instead of from gensim.models import doc2vec

Disclaimer: I may not know what i'm doing at all, which is why im posting here for someone to hopefully verify

gojomo commented 7 years ago

As @tmylk notes, the existing vocab-expansion feature (build_vocab(..., update=True)) wasn't yet designed/tested for Doc2Vec use – so it might work (because of the significant code overlap), or fail in either subtle or extreme ways (lke a SegFault)... it's an unknown.

The times that it's not SegFaulting, there may still be silent corruption – just no memory accesses so bad that they trigger the fault.

Perhaps something in the Doc2Vec paths is still using lengths/references to data that wasn't refreshed by the build_vocab(..., update=True) call?

korostelevm commented 7 years ago

Thats what it seemed like to me, I forced into the slow mode to debug it - At the top of doc2vec.py:

try:
    from gensim.models.doc2vec_inner import train_document_dbow, train_document_dm, train_document_dm_concat
    from gensim.models.word2vec_inner import FAST_VERSION  # blas-adaptation shared from word2vec
    logger.debug('Fast version of {0} is being used'.format(__name__))
    print asdf
# except ImportError:
except Exception:

Then replaced the train function from word2vec and changed if FAST_VERSION < 0: to always run the python threading.

After this instead of getting a segmentation fault I get this in the traceback:

  File "/Users/mike/Dropbox/lsp/recommender/doc2vec_original.py", line 771, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/Users/mike/Dropbox/lsp/recommender/doc2vec_original.py", line 912, in _do_train_job
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "/Users/mike/Dropbox/lsp/recommender/doc2vec_original.py", line 115, in train_document_dbow
    context_locks=doctag_locks)
  File "/usr/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 269, in train_sg_pair
    l1 = context_vectors[context_index]  # input word (NN input/projection layer)
IndexError: index 10 is out of bounds for axis 0 with size 3

Which I think was trying to tell me the index 10 of my doctags is more than the 3 I had in there in the first round of training. So I did the stuff I mentioned above and it seemed to fix the issue. Put back the fast mode flags and it still works.

ArkadiyD commented 7 years ago

I used ddd to debug Cython code and it seemed that the segmentation fault appears at line 123 of doc2vec_inner.pyx: g = (1 - word_code[b] - f) * alpha. Then it turned out that the mistake comes from lines:

if hs:
    codelens[i] = <int>len(predict_word.code)
    codes[i] = <np.uint8_t *>np.PyArray_DATA(predict_word.code)
    points[i] = <np.uint32_t *>np.PyArray_DATA(predict_word.point)

With parameter hs of model set to 0 there are no mistakes (both Python 2. and 3., proved with ddd). So, proposed hotfix is to turn off hs mode when model is upgraded.

tmylk commented 7 years ago

An appropriate hotfix would be to disable vocabulary expansion for doc2vec models, but a proper fix would be better

gojomo commented 7 years ago

Yes, and the proper fix will require figuring out why the model, post-vocab-update, is using some older or incorrect arrays or sizes, and thus making an improper/illegal memory access.

tmylk commented 7 years ago

Current status: only works for hs=0. Hotfix needed: disable for hs > 0.

wjgan7 commented 7 years ago

Looks like I'm still getting a segfault when hs=0. (Based on the doc2vec.py:590, it looks like that is the default, though the docs say it's 1.)

def get_doc2vec():
    return Doc2Vec(size=200,
        iter=1,
        min_count=30,
        workers=multiprocessing.cpu_count(),
        dm=0)

def build_doc2vec(sentences,model=None,total_examples=None,i=0):
    tagged_documents = [TaggedDocument(d,[i]) for d,i in zip(sentences,range(i,i+len(sentences)))]
    if not model:
        model = get_doc2vec()
        model.build_vocab(tagged_documents)
    else:
        model.build_vocab(tagged_documents,update=True)
    model.train(tagged_documents,total_examples=model.corpus_count,epochs=model.iter)
    return (model,i+len(sentences))

Apologies if my code is unclear, but essentially I'm doing the same thing as others above. Any help would be much appreciated.

On a side note, I'm sure I'm using total_examples wrong, but when I put in the real total_examples count across all training calls, it says something like the expected count doesn't match the count for sentences on my current call.

rajivgrover009 commented 7 years ago

Is it useful to call trian() function repeatedly on a Doc2Vec model without adding new vocabulary? Will the model get better for new data?

gojomo commented 7 years ago

@rajivgrover009 Maybe. Whether it helps or hurts is probably dependent on your dataset, choice of parameters, and the relative contrast between your new texts and the earlier texts. The best-grounded course would be to mix new texts with old to make a new all-inclusive corpus, and continue training with that.

gojomo commented 7 years ago

There's another report from @mullenba in #1578, which includes a minimal triggering case.

mino98 commented 6 years ago

I'm trying to look into this. Here is a status update...

Previously, @tmylk reported that doc2vec's document expansions works as long as hs=0. This isn't correct: it crashes if either negative != 0 (default: 5) or hs != 0 (default: 0). In other words, it is useless for all practical purposes.

To debug and iterate quickly, I used this workflow:

  1. change doc2vec_inner.c into doc2vec_inner.pyx at this line of the setup script, so that Cynthonize is invoked automatically every time there's a change in the pyx file.
  2. build with CFLAGS='-Wall -O0 -g' python setup.py build then install.
  3. gdb and cause the crash using using the minimal triggering case in #1578

The coredump points at this line, apparently the index is out of the bounds of EXP_TABLE, which causes segfault.

The equivalent piece of code for word2vec is here. I've read that vocab expansion is supposed to work for word2vec, so I was planning to use that as a guide to check the differences.

Anyone wants to join me in this debugging adventure? 😄


ps: by the way, I tried to deliberately run the "slow" pure-python implementation of doc2vec to see if vocab expansion works. Same problem, it crashes here because doctag_vectors is apparently not expanded correctly and doctag_indexes goes out of bounds.

gojomo commented 6 years ago

The pure-python path isn't actually core-dump 'crashing', is it? (I'd think it'd have to be a printed exception, instead.)

Note that segfault crashes are often caused by earlier memory-corruption, rather than the exact line where they're triggered.

mino98 commented 6 years ago

Note that segfault crashes are often caused by earlier memory-corruption, rather than the exact line where they're triggered.

Thanks, but in this case it seems that indeed the index is pointing outside of EXP_TABLE. I still have to trace it back, though.


The pure-python path isn't actually core-dump 'crashing', is it?

Yes, it's not coredumping. As I said, it goes out of bounds when it reaches the first new doctag (i.e., "animals" at line 29 of this minimal code) as follows:

Traceback (most recent call last):
  File "/x/y/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/x/y/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/x/y/site-packages/gensim-3.2.0-py3.6-linux-x86_64.egg/gensim/models/word2vec.py", line 992, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/x/y/site-packages/gensim-3.2.0-py3.6-linux-x86_64.egg/gensim/models/doc2vec.py", line 752, in _do_train_job
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks
  File "/x/y/site-packages/gensim-3.2.0-py3.6-linux-x86_64.egg/gensim/models/doc2vec.py", line 162, in train_document_dm
    l1 = np_sum(word_vectors[word2_indexes], axis=0) + np_sum(doctag_vectors[doctag_indexes], axis=0)
IndexError: index 1 is out of bounds for axis 0 with size 1

Please note that I had to add the line model.neg_labels = zeros(6) in order for the "slow" version to work at all.

mino98 commented 6 years ago

Pushed this fix for the "slow" version.

Regarding the cythonized version... I'd need more time (and help).

gojomo commented 6 years ago

Sure, but why would the index be out of the expected, functioning range? Often because of some (arbitrarily-)earlier memory-corruption.

menshikh-iv commented 6 years ago

@gojomo I received one more report with this problem, maybe raise an exception for this case (when update=True), because this happens often and often (until we repair the bug itself only).

khulasaandh commented 6 years ago

Hi any update on this issue.

I am able to train doc2vec model with new documents in a 32 bit python(for 64 bit python, it still crashes), but cannot query "model.docvecs.most_similar(["XXX"])" for newly added documents. shows index out for range.

An online approach for doc2vec will be very helpful.

menshikh-iv commented 6 years ago

@khulasaandh as I know, you can infer_vector for new document & calculate needed similarity values.

khulasaandh commented 6 years ago

Hi @menshikh-iv , thanks for the reply.

I am using the same example posted by @danoneata, but have added a few more documents/lines in sentences_1 and sentences_2. As you mentioned, I am computing the infer vector for new document as mentioned below.

infer_vector = model.infer_vector(token_list)
print(model.docvecs.most_similar(positive=[infer_vector]))

It returns me most similar documents but give gives nan values in place of similarity coefficient. [('SENT_0', nan), ('SENT_1', nan), ('SENT_2', nan)]

Am i doing this wrong?

menshikh-iv commented 6 years ago

@khulasaandh looks really suspicious (your code is correct). Can you share data (traned model & token_list) for reproducing this error?

gojomo commented 6 years ago

@khulasaandh @menshikh-iv A separate non-segfault anomaly with infer_vector() would be best diagnosed on the discussion list, or a new issue dedicated to that specific problem.

khulasaandh commented 6 years ago

Hi @menshikh-iv and @gojomo even on 32bit python that I am using, sometimes the segmentation fault still occurs, but most of the time the code runs.

My python version - Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32

please find the code below to replicate the issue.

import logging
from gensim.models.doc2vec import (
    Doc2Vec,
    TaggedDocument,
)

logging.basicConfig(
    format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s',
    level=logging.DEBUG,
)

def to_str(d):
    return ", ".join(d.keys())

SENTS = [
    "anecdotal using a personal experience or an isolated example instead of a sound argument or compelling evidence",
    "plausible thinking that just because something is plausible means that it is true",
    "occam razor is used as a heuristic technique discovery tool to guide scientists in the development of theoretical models rather than as an arbiter between published models",
    "karl popper argues that a preference for simple theories need not appeal to practical or aesthetic considerations",
    "the successful prediction of a stock future price could yield significant profit",
]

SENTS = [s.split() for s in SENTS]

def main():
    sentences_1 = [
        TaggedDocument(SENTS[0], tags=['SENT_0']),
        TaggedDocument(SENTS[1], tags=['SENT_1']),
        TaggedDocument(SENTS[2], tags=['SENT_2']),
    ]
    sentences_2 = [
        TaggedDocument(SENTS[3], tags=['SENT_3']),
        TaggedDocument(SENTS[4], tags=['SENT_4']),
    ]

    model = Doc2Vec(min_count=1, workers=4)

    model.build_vocab(sentences_1)
    model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)

    print("-- Base model")
    print("Vocabulary:", to_str(model.wv.vocab))
    print("Tags:", to_str(model.docvecs.doctags))

    model.build_vocab(sentences_2, update=True)
    model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)

    print("-- Updated model")
    print("Vocabulary:", to_str(model.wv.vocab))
    print("Tags:", to_str(model.docvecs.doctags))

    token_list = "the successful prediction of a stock future price could yield significant profit".split()
    infer_vector = model.infer_vector(token_list)
    print(model.docvecs.most_similar(positive=[infer_vector]))

if __name__ == '__main__':
    main()
menshikh-iv commented 6 years ago

Big thanks @khulasaandh, reproduced with Python 2.7.14 (default, Sep 23 2017, 22:06:14) [GCC 7.2.0] on linux2

Segfault moment

In [6]: model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
/home/ivan/.virtualenvs/math/bin/ipython:1: DeprecationWarning: Call to deprecated `iter` (Attribute will be removed in 4.0.0, use self.epochs instead).
  #!/home/ivan/.virtualenvs/math/bin/python
2018-03-28 02:18:17,204 : MainThread : INFO : training model with 4 workers on 68 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-03-28 02:18:17,207 : Thread-79 : DEBUG : job loop exiting, total 1 jobs
Segmentation fault (core dumped)
muleyprasad commented 6 years ago

Does anyone have a workaround until this gets fixed?

ConfusedMerlin commented 5 years ago

Hello,

I'm currently trying to get gensim to train up a couple of TaggedDocument objects, which originate from a non static source of input-data. Or to put it differently: I need to add non predictable TaggedDocument objects to my doc2vec model on regular base. And - you might guessed it - ran into the same problem as you did.

So its gensim 3.8.0 on a Linux Debian Buster, 64bit.

The workaround offered by nsfinkelstein didn't work at all (beside, I do not know the size of my dictionary) which is sad... and probably caused by my poor Python experience (about.... two weeks?). But (!) I noticed something:

If you are about to add new content to you dictionary, it will go straight into segmentation fault if done in a way one would expect: put new TaggedDocument into the model by using model.build_vocab(documents=newTD, update=True) and then calling model.train(newTD) But by implementing the workaround in a wrong way I noticed that adding TaggedDocuments that are kind of identical to whatever is already present in the vocabulary wont trigger the segmentation fault.

here... look at these:

td1 = TaggedDocument(words=['1','2','3','4','5','6','7','8','9','10'], tags=[]),
td2 = TaggedDocument(words=['11','12','13','14','15'], tags=[]),

As you can see, the second one is kind of logical extension of the first one. And as you might have observed, the dictionary will add one entry for every word in about the order its put inside. So after td1 has been added to the vocab, asking for the vocab will yield '1','2','3','4','5','6','7','8','9','10' Now one would tend to add td2, but this this will cause the fragmentation fault as soon as we call model.train(td2)

But if you do it this way:

td1 = TaggedDocument(words=['1','2','3','4','5','6','7','8','9','10'], tags=[]),
td2 = TaggedDocument(words=['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'], tags=[]),

you can actually train after adding td2 to the vocab.

It will get a bit harder when you need to insert words to the vocab

td1= TaggedDocument(words='Im','very','confused','and','astonished','about','almost','all','and','everything'] tags=[])
td2= TaggedDocument(words=['I','like','cats','and','docs'] tags=[])

td1's vocabulary representation would omit the second 'and', so it would look like this 'Im','very','confused','and','astonished','about','almost','all','everything' if you want to repeat the effect from the numbers I described, more work is needed. One need to extract the existing vocabulary and add all words that are NOT already inside the vocab in the order they appear, and offer all of this as new TaggedDocument:

td3 = TaggedDocument(words='Im','very','confused','and','astonished','about','almost','all','everything','I','like','cats','dogs'] tags=[])

Offering this "build_vocab(td3, update=True)" will allow you to train the existing model with td2

But... yes, there is always a but... while this does work with text (documents/words), as soon as you are trying to add tags to the whole thing, it will went back to segmentation fault itself to death. Not even the "offer a special TaggedDocument" trick can solve this :(

And this brought me into a dead end, because I really need those tags... Any chance someone might find a solution for this?

raccoon-science commented 2 years ago

Hello, @korostelevm I tried to run your code with gensim 4.1.2 and it failed. Perhaps you could share the environment you used to run this code?