Open Laksh47 opened 6 years ago
Hi Laksh,
Sorry for the slow response. I would suggest using something like script, since it's less tied to intermediate data than the sim mat code I have: https://github.com/JoaoLages/TREC_WebTrack/blob/master/utils/construct_embedding_matrix.py
I think it only needs to be modified to calculate/output the docs' similarity matrices, since it stops after creating an embedding matrix from the vocabulary. Let me know if you run into trouble though, and I'll clean up something to release.
Andrew
Hi @andrewyates ,
Thanks a lot! I will make use of this and let you know if I need any clarification.
~Laksh
Hi @andrewyates , @khui
Clarification regarding the difference between desc_doc_mat and topic_doc_mat in simmat.tar.gz
similarity matrices have three folders inside:
Could you please explain the difference here? @andrewyates @khui
If we are calculating cosine similarity matrices between queries(topic) and documents(description), shouldn't there be only one folder holding the matrices(each matrix with shape |q| X |d|)?
Or am I misinterpreting something here? Please clarify!
Thanks, Laksh
Hi @Laksh47,
The topic* files are the similarity matrices between each TREC topic (short query) and each document. The desc* files are between each TREC description (longer query) and each document. You can concatenate them or use either depending on the query type you want to use.
Andrew
Hi @andrewyates , @khui
Clarification regarding the difference between desc_doc_mat and topic_doc_mat in simmat.tar.gz
similarity matrices have three folders inside:
- query_idf
- desc_doc_mat
- topic_doc_mat
Could you please explain the difference here? @andrewyates @khui
If we are calculating cosine similarity matrices between queries(topic) and documents(description), shouldn't there be only one folder holding the matrices(each matrix with shape |q| X |d|)?
Or am I misinterpreting something here? Please clarify!
Thanks, Laksh
Hi, similarity matrices' link is broken, do you hace anything to share with me?
Hi, @andrewyates @khui
I understood from the paper and previous issue comments that calculating the similarity matrices differ based on the input corpus! But, will the code for computing similarity matrices(for PACRR) be made publicly available?
Thanks, Laksh