Closed SeekPoint closed 6 years ago
thanks @andrewyates
but how to generates the similarity matrix from scratch?
You need to preprocess your queries and document collection, and then compute the cosine similarity between the embeddings for every query term q_i and document term dj. Each matrix should be of size |q| by |d|, where element `e{i,j} = cos(q_i, d_j)`.
Is there a concrete command to do that?
No, we haven't gotten that code ready to release yet. It also depends on the format of the corpus you're using. @khui
@loveJasmine Thank you for asking.
The preparation of the similarity matrics involves following steps:
1) get the content of the query q and the document d in terms of term sequence 2) get or train word embeddings fro the terms included in both the query and the document 3) calculate the cosine similarity between terms from a query and from a document, ending up with the similarity matric with shape |q| X |d|
The calculation of the similarity matrix is trivial. However, when dealing with a huge amount of query-document pairs, the efficiency of the calculation is concerned, and one may want to employ parallelization to speed up the procedure, which highly depends on the choice of the computation infrastructure, e.g., map-reduce or spark. Therefore, we did not publish the code for this part and leave that for our users.
See the Getting Started section of the README. You need to download the similarity matrices from that link, configure the paths as described under Usage, and then run
bash bin/train_model.sh
.