This repository includes the dataset and code of the paper: RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering (Findings of ACL 2024) by Zihan Zhang, Meng Fang, and Ling Chen.
⬇️ Download data:
data/retrievalqa.jsonl
or
🤗 HuggingFace Dataset
To evaluate how adaptive RAG performs, we collect questions that the knowledge necessary to answer the questions is absent from LLMs. Therefore, LLMs must truthfully decide whether to retrieve to be able to answer the questions correctly.
Comparison between No, Adaptive, and Always retrieval on RetrievalQA
At least half of the time, GPT-3.5 is unaware that it needs retrieval (red)
The code has been tested under Python 3.9. The following are the steps to set up the environment.
Create conda environment:
conda create -n retrievalqa python=3.9 -y
conda activate retrievalqa
Install PyTorch: we used Pytorch 2.1.2 and CUDA 12.1 in the experiment; however, other versions might also work.
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
Install libraries:
pip install -r requirements.txt
RetrievalQA is a short-form open-domain question answering (QA) dataset comprising 2,785 questions covering new world and long-tail knowledge. It contains 1,271 questions needing external knowledge retrieval and 1,514 questions that most LLMs can answer with internal parametric knowledge.
RetrievalQA is available at the data/retrievalqa.jsonl
, you can also download it from 🤗 HuggingFace Dataset. data/retrievalqa_gpt4.jsonl
contains only 250 selected examples used to test GPT-4 to save costs.
Category | Data Source | # Original | # After Filtering | # Avg. Q Tokens | # Avg. Ans Tokens |
---|---|---|---|---|---|
New world knowledge | RealTimeQA | 397 | 188 | 19 | 3.1 |
FreshQA | 127 | 54 | 13.8 | 3.9 | |
Long-tail knowledge | ToolQA | 100 | 75 | 21.7 | 3.5 |
PopQA | 1,399 | 659 | 8.8 | 4 | |
TriviaQA | 7,313 | 295 | 17.3 | 5.9 | |
Total/Average | RetrievalQA | 9,336 | 1,271 | 13.2 | 4.3 |
Here is an example of a data instance:
{
"data_source": "realtimeqa",
"question_id": "realtimeqa_20231013_1",
"question": "What percentage of couples are 'sleep divorced', according to new research?",
"ground_truth": ["15%"],
"context": [
{
"title": "Do We Sleep Longer When We Share a Bed?",
"text": "1.4% of respondents have started a sleep divorce, or sleeping separately from their partner, and maintained it in the past year. Adults who have ..."
}, ...
],
"param_knowledge_answerable": 0
}
where:
data_source
: the origin dataset of the question comes fromquestion
: the questionground_truth
: a list of possible answerscontext
: a list of dictionaries of retrieved relevant evidence. Note that the title
of the document might be emptyparam_knowledge_answerable
: 0 indicates the question needs external retrieval; 1 indicates the question can be answerable using its parametric knowledge[!IMPORTANT] We have pre-retrieved relevant documents for each question, as shown in the
context
field in the dataset. You can use these pre-retrieved documents for generation; however, please note that some retrieved documents might not have the information necessary to answer the question due to the retriever.In this paper, we focus more on the retrieval accuracy instead of the quality of the retriever. That is, we are more interested in how accurate adaptive retrieval methods are in deciding when to retrieve. You can retrieve documents yourself, as shown in the below Retriever section.
We have provided executable scripts to reproduce the results. Refer to the .sh
files for different settings. If you wish to test GPT-3.5/4, you need to provide OpenAI API in openai_config.txt
.
bash run_lm.sh
prompt_method
: choose from vanilla
prompting or TAARE
promptingretrieval_modes
: choose from ["adaptive_retrieval", "always_retrieval", "no_retrieval"]
model_names
: choose LMs from HuggingFace or use OpenAI API[!NOTE]
You can choose any text generation models from HuggingFace; however, we recommend choosing instruction fine-tuned models and using the suggested prompt templates. Additionally, since we use vllm for accelerated inference, you should use the models that are supported by vllm. Otherwise, you can use HF Pipeline for inference.
bash run_selfrag.sh
thresholds
: set retrieval thresholdBy default, threshold=None
, which means Self-RAG will only retrieve when generating [Retrieval]
tokens.
bash run_selfrag_no_threshold.sh
The code will generate a score_*.json
file which contains all metrics, and a predic_*.jsonl
file which contains all model predictions. We provide our results under the results/reproduce
folder.
In the paper, for questions from different sources, we use differnt retrievers.
RealTimeQA and FreshQA
For new world knowledge questions, we use Google Search API provided by SerpApi. You need to setup a SerpApi API key and refer to google_search.py
for searching. We only use the title
and snippet
from the search results.
ToolQA
The agenda corpus is synthesized with virtual names and events. We use the retriever provided by ToolQA and search relevant documents from the Chroma vector database.
PopQA and TriviaQA
We use the pre-retrieved documents provided by Self-RAG. You can follow their retriever setup to retrieve documents from Wikipedia.
If you find our code, data, or the paper useful, please cite the paper:
@misc{zhang2024retrievalqa,
title={RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering},
author={Zihan Zhang and Meng Fang and Ling Chen},
year={2024},
eprint={2402.16457},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Our data and code are based on previous works:
If you have questions, please raise an issue.