naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
710 stars 79 forks source link

Indexing a document corpus with Efficient SPLADE #41

Closed saarthaks closed 12 months ago

saarthaks commented 1 year ago

What is the process for indexing MS MARCO using Efficient SPLADE?

I see a Dropbox link to download a pre-built index for MS MARCO, and a command to use PISA's query evaluation to retrieve from that index. However, I'd like to reproduce the indexing stage for this and other IR datasets.

DRRV commented 1 year ago

Have a look here :https://github.com/naver/splade/tree/main#evaluating-a-pre-trained-model

saarthaks commented 12 months ago

Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.

The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?

If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl and queries_anserini.tsv file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl and queries_anserini.tsv.

cadurosar commented 12 months ago

Hi Saartak

Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.

The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?

On our side it kinda is the most direct. You can also look into https://github.com/terrierteam/pyterrier_pisa/tree/main which directly creates an index on pisa using Terrier and also allows for querying the index directly as well. They have an example of using SPLADE at the very end of the README. We are still looking into integrating this/using this as the default, but also need to make sure that it does not create a circular dependency (It would lead to: SPLADE depends on Pyterrier_Pisa, which depends on Pyterrier_SPLADE which depends on SPLADE)

If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl and queries_anserini.tsv file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl and queries_anserini.tsv.

Ok, so for the indexcollection you should just pass the path to the folder containing your docs_anserini.jsonl to the input parameter (substitute -input /path/to/msmarco-passage-distill-splade-max to path to docs_anserini). Which is your corpus. Then after when you go for the retrieval portion you put your queries_anserini.tsv (substitute -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz to the path to your queries_anserini.tsv).

Hope this helps, but feel free to ask for more clarification.

saarthaks commented 12 months ago

Hi Carlos,

That's very helpful, thank you! That worked perfectly!