Closed dmitra79 closed 3 years ago
Does your model.dv.index_to_key
list just have 10 individual one-character digits in it?
If so, it's likely that for each document's tags
you've supplied a single numeric string, instead of the expected list-of-distinct-tags. (That list, in the classic case, will only have a single tag per document.)
Supplying a string-of-a-number will make each document look like it has a list of multiple tags, but all those tags are drawn from the digits '0'
to '9'
.
(You can use a raw int, rather than a string, as an entry in a doc's tags
, which offers a slight memory benefit in some verylarge-corpus situations. But if you do, those ints should be contiguous & ascending from 0
. If you use a raw int like 54205396
, the model will allocate enough vectors for the higest int ID encountered – in that case 54-million-plus! – rather than the true number of uniue docs, 128032
. With a manageable number of documents like your count, string tags make the most sense, but still need to be in a list-of-tags, even if a list of one.)
That was the problem - thank you!
Problem description
I am trying to train doc2vec and running into a couple strange things. I have a collection on 128032 documents (derived from MIMIC CXR reports), but when I train doc2vec, I end up with just a fraction of the vectors:
produces:
According to documentation at: https://radimrehurek.com/gensim/models/doc2vec.html 'dv' is supposed to have document vectors. Why are there only 10?
Document examples:
Steps/code/corpus to reproduce
Model lifecycle:
Versions