piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.38k forks source link

doc2vec unique tags incorrect #1057

Open mzduchon opened 7 years ago

mzduchon commented 7 years ago

When creating TaggedDoccument, I figured I would use the ID that I already have for the document. e.g. TaggedDocument(text.split(), [ID]) However, since the ID is an integer, it seems to be assuming that there are that many "unique" documents in the collection: (where the max ID that is seen = 14102100085)

collected 9324 word types and 14102100085 unique tags from a corpus of 13044 examples and 150972 words
estimated required memory for 9324 words and 10 dimensions: 564091276120 bytes

Which gives an out-of-memory error -- that's a lot of bytes!

most tutorials seem to use an enumeration so you get the correct number of "unique" tags, so after I did that it's much more reasonable.

collected 9324 word types and 13044 unique tags from a corpus of 13044 examples and 150972 words
estimated required memory for 9324 words and 10 dimensions: 7794480 bytes
gojomo commented 7 years ago

Yes, this is a known and intentional behavior. If you want to use raw integers as the document-tags, they should start from 0 and be contiguous, so they can be indexes into a compact numpy array.

This is an optimization that lets users capable of using such IDs save a lot of memory, versus the alternative where non-contiguous IDs (such as arbitrary strings or sparse integer indexes) require an extra mapping of ID to compact-index.

Do you have a suggestion for better documenting this behavior, or providing a better error when it causes memory problems?

mzduchon commented 7 years ago

I think it was the word "unique" that threw me off, so you might just explicitly call it the max tag ID. A warning might be shown for whenever the the number of tags greatly exceeds the number of documents, though I understand multiple tags can be used so it doesn't need to be 1-1, but more than 2-to-1 might be a warning, with more than 100-1 actually throwing an error. In any case, I greatly appreciate all the work you guys are doing--while this issue took a couple of hours to track down, you've saved me weeks of work! Thanks!

mkunib commented 3 years ago

I used already existing document IDs as tags and model.build_vocab(train_corpus) took almost 25 minutes to finish. An advice in the documentary would be helpful 👍

I followed the Doc2Vec Model Tutorial.

Code example:

import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

patent1 = {
    'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam""",
    'id': 8759248}
patent2 = {
    'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut""",
    'id': 8134146}
patent3 = {
    'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam""",
    'id': 6497987}
patent4 = {
    'abstract': """Lorem ipsum dolor sit amet,""",
    'id': 9322041}
patent5 = {
    'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At""",
    'id': 7361757}
patents = [patent1, patent2, patent3, patent4, patent5]

train_corpus = []

for patent in patents:
    td = gensim.models.doc2vec.TaggedDocument(patent['abstract'].lower().split(), [patent['id']])
    train_corpus.append(td)

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=5)

def build_vocabulary():
    logging.info('start build_vocab')
    model.build_vocab(train_corpus)
    logging.info('build_vocab finished')

build_vocabulary()
gojomo commented 3 years ago

@mkunib I'm not sure what you're asking.

But, per the above discussion, even if you only have your 5 documents, because you've used a plain-int ID as high as 9322041, the model is going to allocate enough space for 9,322,042 doc-vectors each of 50 dimensions. That'd require about 1.9GB, instead of the 1KB that'd be required if you either (1) used int IDs [0, 1, 2, 3, 4]; or (2) used 5 string IDs – even, strings of your 5 ints.

There should be an error in the logs when you over-allocate like this.

(Even with that over-allocation, I'm a bit surprised if that step took 25 minutes, unless something else is amiss, like that's triggering swapping. But also note that the forthcoming 4.0.0 release has a change to speed up random-initialization in such genuinely-large models.)

mkunib commented 3 years ago

Something like 'A warning might be shown for whenever the the number of tags greatly exceeds the number of documents' in the code would helped me a lot, because I wasn't aware of this. Or an advice in the documentation like: 'Be careful if you are using Integers as tag ids: they should start from 0 and be contiguous' would be great for beginners like me.

Maybe I missed the error in my logs (level=logging.DEBUG)? Should the error be in here?

2020-12-11 13:54:36,042 : INFO : start build_vocab
2020-12-11 13:54:36,043 : INFO : collecting all words and their counts
2020-12-11 13:54:36,043 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-12-11 13:54:36,044 : INFO : collected 23 word types and 9322042 unique tags from a corpus of 5 examples and 75 words
2020-12-11 13:54:36,044 : INFO : Loading a fresh vocabulary
2020-12-11 13:54:36,045 : INFO : effective_min_count=2 retains 20 unique words (86% of original 23, drops 3)
2020-12-11 13:54:36,045 : INFO : effective_min_count=2 leaves 72 word corpus (96% of original 75, drops 3)
2020-12-11 13:54:36,047 : INFO : deleting the raw counts dictionary of 23 items
2020-12-11 13:54:36,047 : INFO : sample=0.001 downsamples 20 most-common words
2020-12-11 13:54:36,048 : INFO : downsampling leaves estimated 11 word corpus (15.9% of prior 72)
2020-12-11 13:54:36,049 : INFO : estimated required memory for 20 words and 50 dimensions: 1864426400 bytes
2020-12-11 13:54:36,049 : INFO : resetting layer weights
gojomo commented 3 years ago

Yes, I thought there was an existing warning that would appear around the "collected" line, when 'unique tags' is larger than 'examples' - but it appears not. (There is in the forthcoming 4.0.0 release.)

If the 25 minute delay occurred right after the resetting layer weights line, that's also a step that will go much faster in 4.0.0 (even if a model is legitimately many-gigabytes of doc-vecs, as opposed to just that large by mistake).