STaRK is a large-scale Semi-structured Retrieval Benchmark on Textual and Relational Knowledge bases, covering applications in product search, academic paper search, and biomedicine inquiries.
Featuring diverse, natural-sounding, and practical queries that require context-specific reasoning, STaRK sets a new standard for assessing real-world retrieval systems driven by LLMs and presents significant challenges for future research.
🔥 Check out our website for more overview!
With python >=3.8 and <3.12
pip install stark-qa
Create a conda env with python >=3.8 and <3.12 and install required packages in requirements.txt
.
conda create -n stark python=3.11
conda activate stark
pip install -r requirements.txt
from stark_qa import load_qa, load_skb
dataset_name = 'amazon'
# Load the retrieval dataset
qa_dataset = load_qa(dataset_name)
idx_split = qa_dataset.get_idx_split()
# Load the semi-structured knowledge base
skb = load_skb(dataset_name, download_processed=True, root=None)
The root argument for load_skb specifies the location to store SKB data. With default value None
, the data will be stored in huggingface cache.
Question answer pairs for the retrieval task will be automatically downloaded in data/{dataset}/stark_qa
by default. We provided official split in data/{dataset}/split
.
There are two ways to load the knowledge base data:
download_processed=True
. download_processed=False
. In this case, STaRK-PrimeKG takes around 5 minutes to download and load the processed data. STaRK-Amazon and STaRK-MAG may takes around an hour to process from the raw data.Our evaluation requires embed the node documents into candidate_emb_dict.pt
, which is a dictionary node_id -> torch.Tensor
. Query embeddings will be automatically generated if not available. You can either run the following the python script to download query embeddings and document embeddings generated by text-embedding-ada-002
. (We provide them so you can run on our benchmark right away.)
python emb_download.py --dataset amazon --emb_dir emb/
Or you can run the following code to generate the query or document embeddings by yourself. E.g.,
python emb_generate.py --dataset amazon --mode query --emb_dir emb/ --emb_model text-embedding-ada-002
dataset
: one of amazon
, mag
or prime
.mode
: the content to embed, one of query
or doc
(node documents).emb_dir
: the directory to store embeddings.emb_model
: the LLM name to generate embeddings, such as text-embedding-ada-002
, text-embedding-3-large
.emb_generate.py
for other arguments.Run the python script for evaluation. E.g.,
python eval.py --dataset amazon --model VSS --emb_dir emb/ --output_dir output/ --emb_model text-embedding-ada-002 --split test --save_pred
python eval.py --dataset amazon --model LLMReranker --emb_dir emb/ --output_dir output/ --emb_model text-embedding-ada-002 --split test --llm_model gpt-4-1106-preview --save_pred
dataset
: the dataset to evaluate on, one of amazon
, mag
or prime
.model
: the model to be evaluated, one of VSS
, MultiVSS
, LLMReranker
.
--emb_model
.LLMReranker
, please specify the LLM name with argument --llm_model
.export ANTHROPIC_API_KEY=YOUR_API_KEY
or
export OPENAI_API_KEY=YOUR_API_KEY
export OPENAI_ORG=YOUR_ORGANIZATION
or (deprecated) locally at config/openai_api_key.txt
or config/claude_api_key.txt
emb_dir
: the directory to store embeddings.split
: the split to evaluate on, one of train
, val
, test
, and human_generated_eval
(to be evaluated on the human generated query dataset).output_dir
: the directory to store evaluation outputs.Please consider citing our paper if you use our benchmark or code in your work:
@article{wu24stark,
title = {STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases},
author = {
Shirley Wu and Shiyu Zhao and
Michihiro Yasunaga and Kexin Huang and
Kaidi Cao and Qian Huang and
Vassilis N. Ioannidis and Karthik Subbian and
James Zou and Jure Leskovec
},
eprinttype = {arXiv},
eprint = {2404.13207},
year = {2024}
}