piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.7k stars 4.38k forks source link

doc2vec not producing all document vectors #3193

Closed dmitra79 closed 3 years ago

dmitra79 commented 3 years ago

Problem description

I am trying to train doc2vec and running into a couple strange things. I have a collection on 128032 documents (derived from MIMIC CXR reports), but when I train doc2vec, I end up with just a fraction of the vectors:

model = Doc2Vec(documents, vector_size=5, window=2, min_count=2, workers=4)
print(model.wv.vectors.shape)
print(model.dv.vectors.shape)

produces:

(12801, 5)
(10,5)

According to documentation at: https://radimrehurek.com/gensim/models/doc2vec.html 'dv' is supposed to have document vectors. Why are there only 10?

Document examples:

TaggedDocument(['no', 'acute', 'cardiopulmonary', 'process.', 'there', 'is', 'no', 'focal', 'consolidation,', 'pleural', 'effusion', 'or', 'pneumothorax.', 'bilateral', 'nodular', 'opacities', 'that', 'most', 'likely', 'represent', 'nipple', 'shadows.', 'cardiomediastinal', 'silhouette', 'is', 'normal.', 'clips', 'project', 'over', 'left', 'lung,', 'potentially', 'within', 'breast.', 'imaged', 'upper', 'abdomen', 'is', 'unremarkable.', 'chronic', 'deformity', 'posterior', 'left', 'sixth', 'seventh', 'ribs', 'are', 'noted.'], 50414267)
TaggedDocument(['no', 'acute', 'cardiopulmonary', 'process.', 'lungs', 'are', 'clear', 'focal', 'consolidation,', 'pleural', 'effusion', 'or', 'pneumothorax.', 'heart', 'size', 'is', 'normal.', 'mediastinal', 'contours', 'are', 'normal.', 'multiple', 'surgical', 'clips', 'project', 'over', 'left', 'breast,', 'old', 'left', 'rib', 'fractures', 'are', 'noted.'], 56699142)
TaggedDocument(['no', 'evidence', 'acute', 'cardiopulmonary', 'process.', 'as', 'compared', 'prior', 'examination', 'dated', '___,', 'there', 'has', 'been', 'no', 'significant', 'interval', 'change.', 'there', 'is', 'no', 'evidence', 'focal', 'consolidation,', 'pleural', 'effusion,', 'pneumothorax,', 'or', 'frank', 'pulmonary', 'edema.', 'cardiomediastinal', 'silhouette', 'is', 'within', 'normal', 'limits.', 'there', 'is', 'persistent', 'thoracic', 'kyphosis', 'with', 'mild', 'wedging', 'mid', 'thoracic', 'vertebral', 'body.'], 54205396)

Steps/code/corpus to reproduce

Model lifecycle:

[{'msg': 'effective_min_count=2 retains 12801 unique words (100.0%% of original 12801, drops 0)', 'datetime': '2021-07-13T17:40:07.046896', 'gensim': '4.0.1', 'python': '3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18363-SP0', 'event': 'prepare_vocab'}, {'msg': 'effective_min_count=2 leaves 6811903 word corpus (100.0%% of original 6811903, drops 0)', 'datetime': '2021-07-13T17:40:07.047886', 'gensim': '4.0.1', 'python': '3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18363-SP0', 'event': 'prepare_vocab'}, {'msg': 'downsampling leaves estimated 4932315.19961421 word corpus (72.4%% of prior 6811903)', 'datetime': '2021-07-13T17:40:07.268939', 'gensim': '4.0.1', 'python': '3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18363-SP0', 'event': 'prepare_vocab'}, {'msg': 'training model with 4 workers on 12801 vocabulary and 5 features, using sg=0 hs=0 sample=0.001 negative=5 window=2', 'datetime': '2021-07-13T17:40:07.857205', 'gensim': '4.0.1', 'python': '3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18363-SP0', 'event': 'train'}, {'msg': 'training on 68119030 raw words (59567377 effective words) took 440.6s, 135184 effective words/s', 'datetime': '2021-07-13T17:47:28.499426', 'gensim': '4.0.1', 'python': '3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18363-SP0', 'event': 'train'}, {'params': 'Doc2Vec(dm/m,d5,n5,w2,mc2,s0.001,t4)', 'datetime': '2021-07-13T17:47:28.501476', 'gensim': '4.0.1', 'python': '3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.18363-SP0', 'event': 'created'}]

Versions

Windows-10-10.0.18363-SP0
Python 3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)]
Bits 64
NumPy 1.20.2
SciPy 1.6.2
gensim 4.0.1
FAST_VERSION 1
gojomo commented 3 years ago

Does your model.dv.index_to_key list just have 10 individual one-character digits in it?

If so, it's likely that for each document's tags you've supplied a single numeric string, instead of the expected list-of-distinct-tags. (That list, in the classic case, will only have a single tag per document.)

Supplying a string-of-a-number will make each document look like it has a list of multiple tags, but all those tags are drawn from the digits '0' to '9'.

(You can use a raw int, rather than a string, as an entry in a doc's tags, which offers a slight memory benefit in some verylarge-corpus situations. But if you do, those ints should be contiguous & ascending from 0. If you use a raw int like 54205396, the model will allocate enough vectors for the higest int ID encountered – in that case 54-million-plus! – rather than the true number of uniue docs, 128032. With a manageable number of documents like your count, string tags make the most sense, but still need to be in a list-of-tags, even if a list of one.)

dmitra79 commented 3 years ago

That was the problem - thank you!