how to run it from scratch?

andrewyates commented 6 years ago

See the Getting Started section of the README. You need to download the similarity matrices from that link, configure the paths as described under Usage, and then run bash bin/train_model.sh.

SeekPoint commented 6 years ago

thanks @andrewyates

but how to generates the similarity matrix from scratch?

andrewyates commented 6 years ago

You need to preprocess your queries and document collection, and then compute the cosine similarity between the embeddings for every query term q_i and document term dj. Each matrix should be of size |q| by |d|, where element `e{i,j} = cos(q_i, d_j)`.

SeekPoint commented 6 years ago

Is there a concrete command to do that?

andrewyates commented 6 years ago

No, we haven't gotten that code ready to release yet. It also depends on the format of the corpus you're using. @khui

khui commented 6 years ago

@loveJasmine Thank you for asking.

The preparation of the similarity matrics involves following steps:

1) get the content of the query q and the document d in terms of term sequence 2) get or train word embeddings fro the terms included in both the query and the document 3) calculate the cosine similarity between terms from a query and from a document, ending up with the similarity matric with shape |q| X |d|

The calculation of the similarity matrix is trivial. However, when dealing with a huge amount of query-document pairs, the efficiency of the calculation is concerned, and one may want to employ parallelization to speed up the procedure, which highly depends on the choice of the computation infrastructure, e.g., map-reduce or spark. Therefore, we did not publish the code for this part and leave that for our users.

khui / copacrr

how to run it from scratch? #6