Document how doc2vec is including tags into PV-DM and how PV-DBOW compares to the publication

In the original Paragraph Vector publication only unique identifiers for the different paragraphs ("documents" in gensim) are used. Two models are presented PV-DM, which is an extension of CBOW of word2vec, and PV-DBOW, which is analogous (not so much an extension) to Skip-Gram of word2vec.

In the following let's assume C context words.

PV-DM

The article https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e explains the extension nicely: Additionally to the C input layers for the C context words (which share the same neural weights though!) we add a single additional input layer which is special, because it has it's own neural weights. Each paragraph/document has it's own ID which is like a special word. As far as I understand, the paragraph ID is also one-hot encoded. The article https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e further expands on this by adding another layer for tags, which are not paragraph IDs!

What is not clear to me though, is, whether each tag gets its own layer or whether all tags together get one layer. If it is the former, we would always need the same number of tags for each paragraph and the input of each layer would be an one-hot-encoded vector. If it is the later (one layer for all tags), then if there is more than one tag, obviously the tag-input-vector would be hot-encoded, but not one-hot-encoded. Either it is using multiple 1s, or it is using 1 divided by the number of tags of the paragraph as the "hot" entries. That is also speculation on my part.

The implementation of the article is with gensim. However, in the implementation it looks, like the document ID is handled like any other tag. This suggests that gensim is actually not implementing it as the article suggests (with ID and tags of separated layers), nor does gensim implement it with as the original publication (which only has paragraph-unique IDs), but basically it considers everything to be a tag. Further it seems to me from another article (which I can't find right now) that it is possible to give each document a different number of tags. This would suggest to me that gensim is using an implementation with one layer for those tags (I described before why I think that).

More specifically, my specific questions would be:

Can gensim documents have different numbers of tags per document?
Are there multiple layers for each tag per document?
Are document IDs and tags treated differently or within the same layer?
If there is one layer for tags, how are tags encoded? As hot-encoded with multiple 1s or with 1/(number of tags for this document) or how?

To find out I checked the source code of gensim, which wasn't that easy. I think the methods train_document_dm, train_document_dm_concat are most relevant here. Let's look at train_document_dm first. Here window_pos seems to be the sliding window, making word2_indexes the context words and word the central word. (Interestingly: In the original publication there is no central word, but the predicted word is the last word of the window!) There is also some random window reduction going on - I guess that is similar to negative sampling or something like that - not sure. count seems to be the number of context words and tags - I think it is used for locking weights at the end of the method. Before that however, we go into the method train_cbow_pair. Here I am basically lost... I guess neu1e(what kind of word is this?) is somehow the weight update - however - this would mean that the same learning step is applied to the word weights as well as to the tag weights (line 172 and 175) - which makes little sense: Even if train_document_dm is doing mean-averaging as the aggregation after the hidden layer, I expect different updates for the tag weights and the word weights.

train_document_dm_concat is using concatintion as aggregation instead of mean-averaging and neu1e_r seems to provide at least different updates for the different weight matrices, seen in line 241 and 243 with neu1e_r[:doctag_len] and neu1e_r[doctag_len:] respectively.

PV-DBOW

Regarding PV-DBOW, https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e sticks with the original publication. The relevant method in gensim seems to be train_document_dbow. According to publication, the input of PV-DBOW is the document ID and the target vectors are the one-hot-encoded words of the sliding window. This is very similar to Skip-Gram, so gensim is using that implementation. However, train_document_dbow seems in

        if train_words and learn_words:
            train_batch_sg(model, [doc_words], alpha, work)
        for doctag_index in doctag_indexes:
            for word in doc_words:
                train_sg_pair(
                    model, word, doctag_index, alpha, learn_vectors=learn_doctags, learn_hidden=learn_hidden,
                    context_vectors=doctag_vectors, context_locks=doctag_locks
                )

to first train a Skip-Gram model on the words without using the doctags and then train the Skip-Gram model in such a way that each word is the input and the tags are the output - very different to publication. Here I can only ask: What's up with that?

Conclusion

I raised quite a few questions while comparing the original publication, an article on the topic I found only and the gensim implementation. The articles leaves a couple of questions open - checking out the gensim source code did not answer any, but raised more questions regarding PV-DM.

Comparing the gensim implementation of PV-DBOW with the original pubclication made matters worse, as they don't seem to be related at all.

I am sure there is a lot of misunderstanding on my part. However, I would appreciate it a lot if there could be clarification - first here in the thread. Later additionally in a documentation if possible. Adding a lot more commentary to the source code might also be a good addition.

related issue #1920 CC: @gojomo

@steremma this is work for you too :dancer:

The project discussion list, https://groups.google.com/forum/#!forum/gensim, is a better forum for these open-ended requests for explanation of source code, as there's no bug or even (in my opinion) gaps in the documentation.

Regarding the PV-DM questions:

(1) Yes, that's why the TaggedDocument template class for text examples accepts a tags argument that is a list of one-or-more tags.

(2) No. Each 'tag' is a lookup-key into an array of candidate vectors-in-training. All tags pull from this same array, just like all known vocabulary words pull candidate word vectors from a shared store.

(3) There's no notion of 'document IDs' in the Doc2Vec implementation, but using unique document IDs as tags is a common strategy. (That is, the tags can be anything, but a usual approach is to use a unique ID per document as its one and only tag.)

(4) Tags are just lookup-keys. A tag key retrieves a matched candidate vector. That vector gets training-nudged by text examples which reference its tag - this is fully analogous to the handling of word-vectors in pure Word2Vec modes. (The document-tags come from a separate namespace from words.)

Regarding the PV-DBOW question:

Pure PV-DBOW as described in the PV paper doesn't create (input/context) word-vectors at all – so by default, train_words is false, and no skip-gram training is occurring, just the doctag-training.

But, it's common to want word-vectors as well, and the PV followup paper ('Document Embedding with Paragraph Vectors' https://arxiv.org/abs/1507.07998) mentions doing the (highly-analogous) skip-gram training in parallel with doc-vector training. So with the non-default parameter dbow_words=1, gensim Doc2Vec allows this style of simultaneous word- and DBOW-vector training.

@gojomo - Thanks for the clarifications: To transfer this into ML / Math lingo: All the tags get one layer, all the words get one layer, and these two layers are aggregated after the hidden layer. Training pairs for words are Word -> C1, Word -> C2, Word -> C3 instead of one Word -> {C1, C2, C3, C4}, as established in https://github.com/RaRe-Technologies/gensim/issues/1920.

Thank you for explanation regarding train_words and pointing to the follow-up paper. Assuming train_words=False, there is still something strange to me: train_sg_pair expects word and context_index. From predict_word = model.wv.vocab[word] # target word (NN output) I can assume that word is actually the target, so the context word for Skip-Gram, while context_index is the input, so the central word for Skip-Gram. (They might be indexes for dictionaries, not the words themselves, but I am interested in the principle here, not the implementation details.) This would make sense when I compare the code for Doc2Vec as well as for Word2Vec. If I got this correctly, this would be a bit misleading for a method signature, and might be a point for improving the readability of the code :-).

You can certainly think of the mappings from single words (or single tags) to dense vectors as a "projection layer" – but in terms of implementation, it's still a discrete lookup-by-key-and-index (rather than forward-propagation via one-hot vector ops). And in the modes like PV-DM where doc-vec & word-vectors mix, they come from separate storage arrays but are essentially the same 'layer' of the NN (as they combine equally, simultaneously, the same number of hops back from the outputs – and sequentially 'before' the hidden-layer, to set its value, rather than 'after'.)

As mentioned in my 1st comment on your skip-gram question #1920, whether you interpret skip-gram as predicting the central-word from each context-word, or each context-word from the central-word, results in the exact same set of (word->word) training pairs overall, just in a slightly different order. And, the original paper/word2vec.c authors switched their sense of direction-of-prediction, apparently for slight performance reasons, between the paper write-up and their word2vec.c release. So any attachment to the idea that the 'context' word, in skip-gram, is specifically always the NN input, or the NN target, is unnecessary - either way works, and the creators of the technique & reference implementation treated them interchangeably.

Regarding your first paragraph: To clarify what I meant by layer: Look at Figure 2 in https://arxiv.org/pdf/1405.4053.pdf. Here ParagraphID, "the", "cat", "sat", each get one layer - so I mean a matrix multiplication / lookup-by-key-and-index as a layer. Because of the learning, the implementation of gensim would not train for "the", "cat", "sat" against "on" at the same time, but one after the other against "on", so only one layer (as I used the word here) is required. This is the one lookup table. ParagraphID would still get its own layer / matrix multiplication / lookup-by-key-and-index. In any case, I understand now how it works :+1:

Regarding your second paragraph: Point taken. Once you train with word->word, instead of originally proposed by the paper, there is not much of the difference regarding context word and central word. However, what is then the difference between Skip-Gram and CBOW in the first place?

Skip-gram provides (word->word) training examples to the NN, CBOW provides (mean-of-words-in-window->word) examples. Thus one backprop updates one word in skip-gram, but all words-in-window in CBOW.

piskvorky / gensim