Open gojomo opened 4 years ago
As an addition to the SO post, I want to add new documents to the model.
It seems this should be done with the add()
method, but since this is not working I figured the following work-around out:
model = Doc2Vec.load(PATH_to_model)
# Add vector and identifier to original values
model.docvecs.vectors_docs = np.vstack([model.docvecs.vectors_docs, new_vec])
model.docvecs.index2entity.append(new_identifier)
# Test if new document is included
model.docvecs.most_similar(positive = [new_vec])
Calling the most_similar()
method returns results including this new document, also after saving and loading the model. So it seems to work.
My question is whether this is a 'correct' way of working around this bug, or if I am missing something.
@ThijsKranenburg - If it works for your purposes, it's good enough! Note though you've not yet done enough to look-up the new vectors by identifier – that's also require adding entries to the model.docvecs.doctags
dict. And the possible effects of such a workaround on any further training are unclear.
Per user report on SO, neither assignment to a bracketed-access (as would be implemented by
__setitem__()
) nor use of theadd()
method will successfully mutate aDoc2VecKeyedVectors
object.Looking closer, it seems the superclass
__setItem__()
passes through to superclassadd()
, which was only ever implemented for word-centric sets of vectors – consulting/updating properties like.vocab
that only exist as empty values inDoc2VecKeyedVectors
because of the currently confused inheritance created by #1777.