Implement a code vector for each document (i.e factor in this case) to be learned and concatenated with the codes for each word to learn the embeddings.
Incredibly, there isn't an equation in the original paper describing how to do this (which is shocking), but there are other implementations. This one seems readable, and this page actually has derivations, which will help augment the model I'm currently working with.
Without some probabilistic interpretation which would allow for the decoding of a window without an associated document, this extension seems unlikely to be useful. But it should be informative as to how much separation I can get just by including factor information in the generation of the code words.
I might still be able to use the code-words learned in this way as some empirical Bayes-style prior in a more principled model.
While the codes for training a doc2vec style model are possible at training time, they are not at test time; we aren't trying to retrieve similar documents. Given a sequence (and possibly just a subsequence), we're trying to decide which TF is most likely to be bound according to our model preferences.
At training time, we can include random a random variable which codes for a given TF, and learn the codes for each TF (though probably not as they are learned in doc2vec).
But at test time, we will not have the code for the factor(s) which have high affinity for the test sequences. Instead, given the codes for the k-mers in the sequence, we want to try inferring the factor(s) involved.
This will probably only work if the codes for kmers are influenced to be closer to the codes for the factor exemplars.
Implement a code vector for each document (i.e factor in this case) to be learned and concatenated with the codes for each word to learn the embeddings.
Incredibly, there isn't an equation in the original paper describing how to do this (which is shocking), but there are other implementations. This one seems readable, and this page actually has derivations, which will help augment the model I'm currently working with.
Without some probabilistic interpretation which would allow for the decoding of a window without an associated document, this extension seems unlikely to be useful. But it should be informative as to how much separation I can get just by including factor information in the generation of the code words.
I might still be able to use the code-words learned in this way as some empirical Bayes-style prior in a more principled model.