oriyor / ret-robust

Implementation of the paper: "Making Retrieval-Augmented Language Models Robust to Irrelevant Context"
MIT License
53 stars 1 forks source link

ColBert Retrieval #3

Open kerkathy opened 6 months ago

kerkathy commented 6 months ago

Hello

Thank you again for your generous code release 😊 I tried to follow your instruction here, downloaded the cached data. Currently I would like to run the ColBert experiment and I'm having some stupid questions :P

  1. Is the cached retrieval data the @1, @10, or random retrieval results? I noticed that there are many json files in the unzipped cached file. Does each json file correspond to the retrieval result to a question? Is the json file named with its question id or anything meaningful?
  2. If I would like to run the ColBERT experiment, do I only need to change the "retriever" field in config file from serp into sth like colbert?
  3. If I want to try out some new retrieval method and result, should I retrieve by myself somewhere else beforehand and probably generate a folder like the cached dir, store all retrieval results there, then change the decomposition.main_retriever_dir field in the config to be pointing to the new folder which contains my new retrieval result?

p.s. just a friendly reminder of a small typo at the bottom of the page: At the end of the sentence should it be randomize_retrieval and retrieve_at_10 instead of randomize_retrieval and andomize_retrieval? _... RetRobust experiments require configuring the following fields: ... To run a single setting, use the randomize_retrieval and retrieve_at_10 fields._

Tks again!

oriyor commented 5 months ago

Hi, thanks for the questions!

  1. Each cache file indeed corresponds to a question, it is simply a hash over the question (see the get_string_hash method). For @1 retrieval, simply set the randomize_retrieval and retrieve_at_10 flags as false. For @10 retrieval, set the retrieve_at_10 as true, this well cause the model to use the lower ranked retrieval result. For random retrieval, set randomize_retrieval as true and run_output_dir should be the dir with the retrieval result you want to randomly sample from. To run all three settings in one run, simply set settings field to ["reg", "random", "@10"], and this will update those field on the flight (see lines 554-571 in run.py). I hope this makes sense :)
  2. Yes, simply change the retriever to colbert! This will cause the retriever to use the get_question_wiki_snippet_colbert method. I hope the ColBERT server I was using is still available, otherwise plz let me know if there are any problems (you can always use SerpAPI, the first calls are free).
  3. Sure, that could work! You take a look at lines 181-189 in serpapi.py to see how we currently differentiate between colbert and serp, it simply requires updating the methods to call the retriever and cache the results.

Thx so much for catching the typo! And plz let me know if there are still issues :)