piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

Doc2VecKeyedVectors doesn't effectively support __setitem__()/add() #2683

Open gojomo opened 4 years ago

gojomo commented 4 years ago

Per user report on SO, neither assignment to a bracketed-access (as would be implemented by __setitem__()) nor use of the add() method will successfully mutate a Doc2VecKeyedVectors object.

Looking closer, it seems the superclass __setItem__() passes through to superclass add(), which was only ever implemented for word-centric sets of vectors – consulting/updating properties like .vocab that only exist as empty values in Doc2VecKeyedVectors because of the currently confused inheritance created by #1777.

ThijsKranenburg commented 4 years ago

As an addition to the SO post, I want to add new documents to the model.

It seems this should be done with the add() method, but since this is not working I figured the following work-around out:

model = Doc2Vec.load(PATH_to_model)

# Add vector and identifier to original values
model.docvecs.vectors_docs =  np.vstack([model.docvecs.vectors_docs, new_vec])
model.docvecs.index2entity.append(new_identifier)

# Test if new document is included
model.docvecs.most_similar(positive = [new_vec])

Calling the most_similar() method returns results including this new document, also after saving and loading the model. So it seems to work.

My question is whether this is a 'correct' way of working around this bug, or if I am missing something.

gojomo commented 4 years ago

@ThijsKranenburg - If it works for your purposes, it's good enough! Note though you've not yet done enough to look-up the new vectors by identifier – that's also require adding entries to the model.docvecs.doctags dict. And the possible effects of such a workaround on any further training are unclear.