mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain
33 stars 14 forks source link

Use Doc2Vec embedding for semantic search #38

Open rht opened 7 years ago

rht commented 7 years ago

From what I have read from the code, the input images are transformed into (hundred?) sentences based on https://arxiv.org/abs/1511.06361, which are then indexed by ES, and graphed by ES-NN. There is a WIP implementation of using annoyIndexer for space efficiency.

WDYT of using doc2vec for the vectorization instead of the vanilla tf-idf? Doc2vec is a well-established post-word2vec method and should boost the precision/recall score. At the same time, it will take a while for even word2vec to make it into ES stable (though it is already in spark).

For reference, gensim's document similarities lib already uses annoy https://github.com/RaRe-Technologies/gensim/blob/699773a78f32ff2c254cf16a586a5d26a08394fe/gensim/similarities/index.py#L22.

@autoencoder ?

parkan commented 7 years ago

I believe the aesthetics and relevance vectors already use sparse NN lookups, including annoy. See here: https://github.com/mediachain/mediachain-indexer/blob/05e065c240fc95833db95ab386da7105adef798b/mediachain/indexer/mc_neighbors.py#L547

rht commented 7 years ago

(NNs) Only https://github.com/mediachain/mediachain-indexer/blob/05e065c240fc95833db95ab386da7105adef798b/mediachain/indexer/mc_neighbors.py#L625 has been fully implemented. Both the sparse NN lookup (pysparnn) and dense NN lookup (annoy) are still WIP.

(models)

WDYT of using doc2vec for the vectorization instead of the vanilla tf-idf?

In https://arxiv.org/abs/1511.06361, a word-embedding method (just the skip-thought) is only tested at the textual entailment benchmark against the COCO dataset, though not for caption-image retrieval (the current use case of mediachain-indexer).

parkan commented 7 years ago

Looks like you're right, I guess we didn't quite get to that part. This part of the project is not at very high priority right at this moment so I'm not sure when we're going to finish them, but also welcome any thoughts from @autoencoder

autoencoder commented 7 years ago

Hi @rht!

The current diagram is accurate in that the tf-idf is only being used for broad candidate pre-filtering, those candidates then being reranked via the relevance model. The relevance model uses an RNN and VGG16 for the first step of embedding the text and images respectively, followed by a custom layer for each which was trained to based on click log training data, to embed texts close to images that describe each other.

Sure, doc2vec could be a decent substitute for the RNN in the relevance model, if you want to train on a small training set, are training on large text documents, or if word order doesn't particularly matter for your search task.

Yes, Annoy is being used for reverse lookups. It's not currently being used for the reranking distance calculation step, in favor of exact distance computations, due to the number of candidates at that stage being quite small.

We did try an experiment with using Annoy & the relevance embeddings for candidate pre-filtering, instead of ES, but in this brief experiment the results were mixed.