VerifAI project

No more searches, just verifiably accurate answers.

Project Description

VerifAI project aims to address problem of hallucinations in generative large language models and generative search engines. Initially, we started with biomedical domain, however, now we have expanded VerifAI to support indexing any documents in txt,md, docx, pptx, or pdf formats.

VerifAI is an AI system designed to answer users' questions by retrieving the most relevant documents, generate answer with references to the relevant documents and verify that the generated answer does not contain any hallucinations. In the core of the engine is generative search engine, powered by open technologies. However, generative models may hallucinate, and therefore VerifAI is developed a second model that would check the sources of generative model and flag any misinformation or misinterpretations of source documents. Therefore, make the answer created by generative search engine completly verifiable.

The best part is, that we are making it open source, so anyone can use it!

Check the article about VerifAI project published on TowardsDataScience

Main features

Easy installation by running a single script
Easy indexing of local files in PDF, PPTX, DOCX, MD and TXT formats
Combination of lexical and semantic search to find the most relevant documents
Usage of any HuggingFace listed model for document embeddings
Usage of any LLM that follows OpenAI API standard (deployed using vLLM, Nvidia NIM, Ollama, or via commercial APIs, such as OpenAI, Azure)
Supports large amounts of indexed documents (tested with over 200GB of data and 30 million documents)
Shows the closest sentence in the document to the generated claim
User registration and log-in
Pleasent user interface developed in React.js
Verification that generated text does not contain hallucinations by a specially fine-tuned model

Installation and start-up

VerifAI Core

Clone the repository or download latest release

Create virtual python environment by running:

python -m venv verifai
source verifai/bin/activate

Run requirements.txt by running pip install -r backend/requirements.txt
Run install_datastore.py file. To run this file, it is necessary to install Docker (and run the daemon). This file is designed to install necessary components, such as OpenSearch, Qdrant and PostgreSQL, as well as to create database in PostgreSQL.
```
python install_datastore.py
```
Configure system, by replacing and modifying .env.local.example in backend folder and rename it into just .env: The configuration should look in the following manner:
```
SECRET_KEY=6183db7b3c4f67439ad61d1b798224a035fe35c4113bf870
ALGORITHM=HS256
```

DBNAME=verifai_database USER_DB=myuser PASSWORD_DB=mypassword HOST_DB=localhost

OPENSEARCH_IP=localhost OPENSEARCH_USER=admin OPENSEARCH_PASSWORD=admin OPENSEARCH_PORT=9200 OPENSEARCH_USE_SSL=False QDRANT_IP=localhost QDRANT_PORT=6333 QDRANT_API=8da7725d78141e19a9bf3d878f4cb333fedb56eed9727904b46ce4b32e1ce085 QDRANT_USE_SSL=False

OPENAI_PATH=<path-to-openai/azure/vllm/nvidia_nim/ollama-interface> OPENAI_KEY= DEPLOYMENT_MODEL=GPT4o MAX_CONTEXT_LENGTH=128000

EMBEDDING_MODEL="sentence-transformers/msmarco-bert-base-dot-v5"

INDEX_NAME_LEXICAL = 'myindex-lexical' INDEX_NAME_SEMANTIC = "myindex-semantic"

USE_VERIFICATION=True

6. Index your files, by running index_files.py and pointing it to the directory with files you would like to index. It will recuresevly index all files in the directory.
```shell
python index_files.py <path-to-directory-with-files>

As an example, we have created a folder with some example files in the folder test_data. You can index them by running:

python index_files.py test_data

Run the backend of VerifAI by running main.py in the backend folder.
```
python main.py
```
Install React by following this guide
Install React requirements for the front-end in client-gui/verifai-ui folder and run front end:
```
cd ..
cd client-gui/verifai-ui
npm install
npm start
```
Go to http://localhost:3000 to see the VerifAI in action.

You can check a tutorial on deploying VerifAI published on Towards Data Science

VerifAI BioMed

This is biomedical version of VerifAI. It is designed to answer questions from the biomedical domain.

One requirement to run locally is to have installed Postgres SQL. You can install it for example on mac by running brew install postgresql.

Clone the repository
Run requirements.txt by running pip install -r backend/requirements.txt
Download Medline. You can do it by executing download_medline_data.sh for core files for the current year and download_medline_data_update.sh for Medline current update files.
Install Qdrant following the guide here
Run the script: python medline2json.py to transform MEDLINE XML files into JSON
Run python json2selected.py to selects the fields that should be inported into the index
Run python abstarct_parser.py to concatinate abstract titles and abstracts and splits texts to 512 parts that can be indexed using a transformer model
Run python embeddings_creation.py to create embeddings.
Run python scripts/indexing_qdrant.py to create qdrant index. Make sure to point to the right folder created in the previous step and to the qdrant instance.
Install OpenSearch following the guide here
Create OpenSearch index by running python scripts/indexing_lexical_pmid.py. Make sure to configure access to the OpenSearch and point the path variable to the folder created by json2selected script.

Set up system variables that are needed for the project. You can do it by creating .env file with the following content:


OPENSEARCH_IP=open_search_ip
OPENSEARCH_USER=open_search_user
OPENSEARCH_PASSWORD=open_search_pass
OPENSEARCH_PORT=9200
QDRANT_IP=qdrant_ip
QDRANT_PORT=qdrant_port
QDRANT_API=qdrant_api_key
QDRANT_USE_SSL=False
OPENSEARCH_USE_SSL=False
MAX_CONTEXT_LENGTH=32000