This repository presents a comprehensive benchmark designed to evaluate the performance of various search engines. You can plug-and-play search APIs, and this benchmark works natively with Lumina, exa, semantic scholar, and SERP API. To begin, we are comparing the efficacy of research paper search engines. We specifically compare Lumina, Semantic Scholar, and Google Scholar (via SERP) focusing on two key metrics: Context Relevance and Context Precision. By employing large language models (LLMs) as evaluators, we assess each search result's context relevancy and context precision, for the top 10 search results returned by each search provider. We aim to have as fair of an evaluation as possible. We evaluate the search results returned by each provider and use zero shot (no recursion or LLM improvement) as the default method.
Our most recent result is a comparison between Lumina Base, Lumina Recursive, Semantic Scholar, and Google Scholar.
We measured context relevancy for each search provider's top 10 search results.
Lumina consistently delivers 2-3 highly relevant results for every query - outperforming Google Scholar and Semantic Scholar, which provide 1 highly relevant result for 50% and 30% of the queries, respectively.
This repo requires a .env
file with API keys for each of these services. To get a lumina API_URL for lumina, and gain access to our scientific search API, you can book a meeting with me at https://cal.com/ishaank99/lumina-api.
We setup a local postgres
instance to log the benchmark results, and a local redis
instance for communication between the benchmark and the services. To run the benchmark with recursion, you will need to host a reranker
service. We use the BGE Large reranker. By default, this is turned off.
You can pull the benchmark image with the following command from the root dir of the project:
docker pull index.docker.io/akhilesh99/benchmark:latest
Clone the repo and cd into it
git clone https://github.com/lumina-chat/benchmark.git
cd benchmark
Set environment variables in .env in the root of the project.
pull the benchmark image from dockerhub with:
docker pull index.docker.io/akhilesh99/benchmark:latest
Run docker compose up -d
to start the benchmark. This will start all of the services defined in the compose.yaml
file.
docker compose up -d
Run docker compose logs -f questions
. This will print a Streamlit link to the benchmark dashboard to view progress.
To stop the benchmark, run docker compose down
.
.env
We set up API keys, postgres, redis, and config for the benchmark in this file. You should make a .env
file at the root of the repo with these variables. We use the config.py
file to access these variables, and the .env
file to set them. The python-dotenv
package is used to load the environment variables from the .env file. These include:
NUM_Q=500
(if you want recursion add a "-recursive" to the end of the provider name, like lumina_recursive,google_scholar_recursive,semantic_scholar_recursive)
compose.yaml
The compose.yaml
file orchestrates the deployment of all services required for the benchmark. It defines the configuration for each service, including dependencies, environment variables, and the number of replicas to run. This setup allows for efficient communication between the benchmark and the various search providers, as well as logging and data storage through Redis and PostgreSQL.
This setup allows for efficient benchmarking and evaluation of different search APIs.
benchmark.py
The benchmark.py
script is run separately and performs the actual benchmarking with the following parameters:
generated_questions
and user_queries
ctx_relevancy
You can also create your own custom question datasets for benchmarking. Simply add your JSONL file to the search_benchmark/dataset
folder and use its name (without the .jsonl extension) as a question type when running the benchmark.
The script uses two question types: generated_questions
and user_queries
. These correspond to JSONL files located in the search_benchmark/dataset
folder. Each file contains a set of questions used for the benchmark.
generated_questions
: 9k AI-generated questions for benchmarkinguser_queries
: 9k real user queries from SciSpace for more realistic testingYou don't need to run all questions, you can specify num questions in the benchmark.py
file.
You can modify these files or add new ones to customize the benchmark according to your needs.
The recursive search algorithm enhances search results by using an LLM to generate new questions based on initial search results. This process helps to fill gaps in the original results and provide more comprehensive coverage of the topic.
Initial Search:
page_size_per_recursion
.Generate New Questions:
Based on the user's query: "{question}",
the search result is: {result}
Identify parts of the user's query that were unanswered or need further refinement, and suggest a refined search query to help find better search results.
There should be variation in length, complexity, and specificity across the queries.
The query must be based on the detailed concepts, key-terms, hard values and facts in the result you've been provided.
Wrap it in tags
Recursive Search:
recursion_depth
is reached.Result Processing:
Reranking:
page_size
results.To run adminer, visit localhost:8080 and use the following credentials to log in: