naver / splade

SPLADE: sparse neural search (SIGIR21, SIGIR22)
Other
751 stars 84 forks source link

Instructions on Using Pisa for Splade #24

Closed HansiZeng closed 1 year ago

HansiZeng commented 1 year ago

Firstly, thanks for your series of amazing papers and well-organized code implementations.

The two papers Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation and From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective show that using Pisa can make query retrieval much faster compared to using Anserini or code from the repo for Splade.

The folder efficient_splade_pisa/ in the repo contains the instructions on using Pisa for Splade but the instructions are only for processed queries and indexes. If I only have a well-trained Splade model, how can I process the outputs of the Splade model (sparse vectors or its quantized version for Anserini) to make them suitable for Pisa? Can you provide more specific instructions on this?

Best wishes

cadurosar commented 1 year ago

Hello Zeng,

thanks for your kind comments. Considering that training was performed with the current code and that you have your config well established for the index and retrieve parts, you can use the following command:

SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100

which should generate two files, a doc file (docs_anserini.jsonl) and a query file (queries_anserini.tsv). With those files you can follow this gist from Joel Mackenzie (author of Wacky Weights), which has all the steps from indexing with Anserini to retrieving with PISA (note that you should index with anserini using the file you just generated and retrieve from Anserini/PISA with the query file). https://gist.github.com/JMMackenzie/49d7e837751501067cb16d9940d1ad67

Please let me know if this worked for you or if you had any trouble so that I can help you with this (and maybe add this to our documentation).

Best regards

paravecdesign commented 1 year ago

how can I pass data from a database? to this like

$priorities = Priority::all()->pluck('name');

->selectFilter(key: 'language_code', label: 'Language', options: [ 'en' => 'English', 'nl' => 'Dutch', ])

cadurosar commented 1 year ago

Hi @paravecdesign, you need to use pass your data through SPLADE to get the representation (list of (word,score)) for each entry of your database and then store this in Anserini. We have an example using MSMARCO with a tsv file, but it should be easy to convert to a database using python. Unfortunately we do not have a microservice SPLADE and I'm not sure if there's a way to use pytorch with PHP

@HansiZeng Did this help you? If yes can I close this issue?

cadurosar commented 1 year ago

I'm considering this solved, please feel free to open the issue again if you are still having troubles with this.