hyintell / RetrievalQA

Source code of the paper: RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering [Findings of ACL 2024]
https://arxiv.org/abs/2402.16457
MIT License
58 stars 1 forks source link

RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

This repository includes the dataset and code of the paper: RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering (Findings of ACL 2024) by Zihan Zhang, Meng Fang, and Ling Chen.

⬇️ Download data: data/retrievalqa.jsonl or 🤗 HuggingFace Dataset

📢 News


📖 Introduction

To evaluate how adaptive RAG performs, we collect questions that the knowledge necessary to answer the questions is absent from LLMs. Therefore, LLMs must truthfully decide whether to retrieve to be able to answer the questions correctly.


Comparison between No, Adaptive, and Always retrieval on RetrievalQA


At least half of the time, GPT-3.5 is unaware that it needs retrieval (red)

⚙️ Install Dependencies

The code has been tested under Python 3.9. The following are the steps to set up the environment.

Create conda environment:

conda create -n retrievalqa python=3.9 -y
conda activate retrievalqa

Install PyTorch: we used Pytorch 2.1.2 and CUDA 12.1 in the experiment; however, other versions might also work.

conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

Install libraries:

pip install -r requirements.txt

📋 Data Download & Statistics

RetrievalQA is a short-form open-domain question answering (QA) dataset comprising 2,785 questions covering new world and long-tail knowledge. It contains 1,271 questions needing external knowledge retrieval and 1,514 questions that most LLMs can answer with internal parametric knowledge.

RetrievalQA is available at the data/retrievalqa.jsonl, you can also download it from 🤗 HuggingFace Dataset. data/retrievalqa_gpt4.jsonl contains only 250 selected examples used to test GPT-4 to save costs.

Category Data Source # Original # After Filtering # Avg. Q Tokens # Avg. Ans Tokens
New world knowledge RealTimeQA 397 188 19 3.1
FreshQA 127 54 13.8 3.9
Long-tail knowledge ToolQA 100 75 21.7 3.5
PopQA 1,399 659 8.8 4
TriviaQA 7,313 295 17.3 5.9
Total/Average RetrievalQA 9,336 1,271 13.2 4.3

Here is an example of a data instance:

{
  "data_source": "realtimeqa", 
  "question_id": "realtimeqa_20231013_1", 
  "question": "What percentage of couples are 'sleep divorced', according to new research?", 
  "ground_truth": ["15%"], 
  "context": [
    {
      "title": "Do We Sleep Longer When We Share a Bed?", 
      "text": "1.4% of respondents have started a sleep divorce, or sleeping separately from their partner, and maintained it in the past year. Adults who have ..."
    }, ...
  ],
  "param_knowledge_answerable": 0
}

where:

[!IMPORTANT] We have pre-retrieved relevant documents for each question, as shown in the context field in the dataset. You can use these pre-retrieved documents for generation; however, please note that some retrieved documents might not have the information necessary to answer the question due to the retriever.

In this paper, we focus more on the retrieval accuracy instead of the quality of the retriever. That is, we are more interested in how accurate adaptive retrieval methods are in deciding when to retrieve. You can retrieve documents yourself, as shown in the below Retriever section.

📊 Reproduce the Results

We have provided executable scripts to reproduce the results. Refer to the .sh files for different settings. If you wish to test GPT-3.5/4, you need to provide OpenAI API in openai_config.txt.

Run LLM baselines

bash run_lm.sh

[!NOTE]
You can choose any text generation models from HuggingFace; however, we recommend choosing instruction fine-tuned models and using the suggested prompt templates. Additionally, since we use vllm for accelerated inference, you should use the models that are supported by vllm. Otherwise, you can use HF Pipeline for inference.

Run Self-RAG with specified threshold

bash run_selfrag.sh

Run Self-RAG without threshold

By default, threshold=None, which means Self-RAG will only retrieve when generating [Retrieval] tokens.

bash run_selfrag_no_threshold.sh

The code will generate a score_*.json file which contains all metrics, and a predic_*.jsonl file which contains all model predictions. We provide our results under the results/reproduce folder.

🕸️ Retriever

In the paper, for questions from different sources, we use differnt retrievers.

RealTimeQA and FreshQA

For new world knowledge questions, we use Google Search API provided by SerpApi. You need to setup a SerpApi API key and refer to google_search.py for searching. We only use the title and snippet from the search results.

ToolQA

The agenda corpus is synthesized with virtual names and events. We use the retriever provided by ToolQA and search relevant documents from the Chroma vector database.

PopQA and TriviaQA

We use the pre-retrieved documents provided by Self-RAG. You can follow their retriever setup to retrieve documents from Wikipedia.

🌟Citation

If you find our code, data, or the paper useful, please cite the paper:

@misc{zhang2024retrievalqa,
      title={RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering}, 
      author={Zihan Zhang and Meng Fang and Ling Chen},
      year={2024},
      eprint={2402.16457},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

Our data and code are based on previous works:

🐞Questions?

If you have questions, please raise an issue.