khui / copacrr

The code for COPACRR Neural IR model.
Apache License 2.0
38 stars 11 forks source link

Code for similarity matrix? #8

Open Laksh47 opened 6 years ago

Laksh47 commented 6 years ago

Hi, @andrewyates @khui

I understood from the paper and previous issue comments that calculating the similarity matrices differ based on the input corpus! But, will the code for computing similarity matrices(for PACRR) be made publicly available?

Thanks, Laksh

andrewyates commented 6 years ago

Hi Laksh,

Sorry for the slow response. I would suggest using something like script, since it's less tied to intermediate data than the sim mat code I have: https://github.com/JoaoLages/TREC_WebTrack/blob/master/utils/construct_embedding_matrix.py

I think it only needs to be modified to calculate/output the docs' similarity matrices, since it stops after creating an embedding matrix from the vocabulary. Let me know if you run into trouble though, and I'll clean up something to release.

Andrew

Laksh47 commented 6 years ago

Hi @andrewyates ,

Thanks a lot! I will make use of this and let you know if I need any clarification.

~Laksh

Laksh47 commented 6 years ago

Hi @andrewyates , @khui

Clarification regarding the difference between desc_doc_mat and topic_doc_mat in simmat.tar.gz

similarity matrices have three folders inside:

Could you please explain the difference here? @andrewyates @khui

If we are calculating cosine similarity matrices between queries(topic) and documents(description), shouldn't there be only one folder holding the matrices(each matrix with shape |q| X |d|)?

Or am I misinterpreting something here? Please clarify!

Thanks, Laksh

andrewyates commented 6 years ago

Hi @Laksh47,

The topic* files are the similarity matrices between each TREC topic (short query) and each document. The desc* files are between each TREC description (longer query) and each document. You can concatenate them or use either depending on the query type you want to use.

Andrew

CelineChen95 commented 3 years ago

Hi @andrewyates , @khui

Clarification regarding the difference between desc_doc_mat and topic_doc_mat in simmat.tar.gz

similarity matrices have three folders inside:

  • query_idf
  • desc_doc_mat
  • topic_doc_mat

Could you please explain the difference here? @andrewyates @khui

If we are calculating cosine similarity matrices between queries(topic) and documents(description), shouldn't there be only one folder holding the matrices(each matrix with shape |q| X |d|)?

Or am I misinterpreting something here? Please clarify!

Thanks, Laksh

Hi, similarity matrices' link is broken, do you hace anything to share with me?