usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.

http://usc-isi-i2.github.io/dig/

MIT License

101 stars 39 forks source link

Document similarity worflow #247

Open saggu opened 6 years ago

saggu commented 6 years ago

20180803_104442

Pipeline should work as follows:

Process each incoming document: create sentence vectors indices

Store the indices so that it can be re created if the process dies

For each `query`: compute vector, find k nearest matches irrespective of any threshold and return the ranked result which is a list of document ids with similarity scores

Fetch the documents from ES and return to DIG UI

If the user chooses a facet, add filter to the list of documents for a query, re rank the results and return to DIG UI. So, if originally we had k documents, adding a facet will always return <= k documents. The facets act as a filter

saggu commented 6 years ago

Updated Pipeline 20180806_140425

saggu commented 6 years ago

The followings tasks are done:

Vectorize each sentence using tensorflow
Index the vectors in FAISS index, and store the link in hbase
Able to query a string and return k similar docs back