Open mzduchon opened 7 years ago
Yes, this is a known and intentional behavior. If you want to use raw integers as the document-tags, they should start from 0 and be contiguous, so they can be indexes into a compact numpy array.
This is an optimization that lets users capable of using such IDs save a lot of memory, versus the alternative where non-contiguous IDs (such as arbitrary strings or sparse integer indexes) require an extra mapping of ID to compact-index.
Do you have a suggestion for better documenting this behavior, or providing a better error when it causes memory problems?
I think it was the word "unique" that threw me off, so you might just explicitly call it the max tag ID. A warning might be shown for whenever the the number of tags greatly exceeds the number of documents, though I understand multiple tags can be used so it doesn't need to be 1-1, but more than 2-to-1 might be a warning, with more than 100-1 actually throwing an error. In any case, I greatly appreciate all the work you guys are doing--while this issue took a couple of hours to track down, you've saved me weeks of work! Thanks!
I used already existing document IDs as tags
and model.build_vocab(train_corpus)
took almost 25 minutes to finish. An advice in the documentary would be helpful 👍
I followed the Doc2Vec Model Tutorial.
Code example:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
patent1 = {
'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam""",
'id': 8759248}
patent2 = {
'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut""",
'id': 8134146}
patent3 = {
'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam""",
'id': 6497987}
patent4 = {
'abstract': """Lorem ipsum dolor sit amet,""",
'id': 9322041}
patent5 = {
'abstract': """Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At""",
'id': 7361757}
patents = [patent1, patent2, patent3, patent4, patent5]
train_corpus = []
for patent in patents:
td = gensim.models.doc2vec.TaggedDocument(patent['abstract'].lower().split(), [patent['id']])
train_corpus.append(td)
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=5)
def build_vocabulary():
logging.info('start build_vocab')
model.build_vocab(train_corpus)
logging.info('build_vocab finished')
build_vocabulary()
@mkunib I'm not sure what you're asking.
But, per the above discussion, even if you only have your 5 documents, because you've used a plain-int ID as high as 9322041, the model is going to allocate enough space for 9,322,042 doc-vectors each of 50 dimensions. That'd require about 1.9GB, instead of the 1KB that'd be required if you either (1) used int IDs [0, 1, 2, 3, 4]; or (2) used 5 string IDs – even, strings of your 5 ints.
There should be an error in the logs when you over-allocate like this.
(Even with that over-allocation, I'm a bit surprised if that step took 25 minutes, unless something else is amiss, like that's triggering swapping. But also note that the forthcoming 4.0.0 release has a change to speed up random-initialization in such genuinely-large models.)
Something like 'A warning might be shown for whenever the the number of tags greatly exceeds the number of documents' in the code would helped me a lot, because I wasn't aware of this. Or an advice in the documentation like: 'Be careful if you are using Integers as tag ids: they should start from 0 and be contiguous' would be great for beginners like me.
Maybe I missed the error in my logs (level=logging.DEBUG
)? Should the error be in here?
2020-12-11 13:54:36,042 : INFO : start build_vocab
2020-12-11 13:54:36,043 : INFO : collecting all words and their counts
2020-12-11 13:54:36,043 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-12-11 13:54:36,044 : INFO : collected 23 word types and 9322042 unique tags from a corpus of 5 examples and 75 words
2020-12-11 13:54:36,044 : INFO : Loading a fresh vocabulary
2020-12-11 13:54:36,045 : INFO : effective_min_count=2 retains 20 unique words (86% of original 23, drops 3)
2020-12-11 13:54:36,045 : INFO : effective_min_count=2 leaves 72 word corpus (96% of original 75, drops 3)
2020-12-11 13:54:36,047 : INFO : deleting the raw counts dictionary of 23 items
2020-12-11 13:54:36,047 : INFO : sample=0.001 downsamples 20 most-common words
2020-12-11 13:54:36,048 : INFO : downsampling leaves estimated 11 word corpus (15.9% of prior 72)
2020-12-11 13:54:36,049 : INFO : estimated required memory for 20 words and 50 dimensions: 1864426400 bytes
2020-12-11 13:54:36,049 : INFO : resetting layer weights
Yes, I thought there was an existing warning that would appear around the "collected" line, when 'unique tags' is larger than 'examples' - but it appears not. (There is in the forthcoming 4.0.0 release.)
If the 25 minute delay occurred right after the resetting layer weights
line, that's also a step that will go much faster in 4.0.0 (even if a model is legitimately many-gigabytes of doc-vecs, as opposed to just that large by mistake).
When creating TaggedDoccument, I figured I would use the ID that I already have for the document. e.g. TaggedDocument(text.split(), [ID]) However, since the ID is an integer, it seems to be assuming that there are that many "unique" documents in the collection: (where the max ID that is seen = 14102100085)
Which gives an out-of-memory error -- that's a lot of bytes!
most tutorials seem to use an enumeration so you get the correct number of "unique" tags, so after I did that it's much more reasonable.