rag-api-pipeline
is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG (Retrieval Augmented Generation).
The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use this tool, tech stack and/or how it works under the hood, check the content menu on the left.
Git clone or download this repository to your local.
If using a custom virtual environment, you should activate your virtual environment, otherwise poetry will handle the environment for you.
Navigate to the directory where this repository was cloned/download and execute the following on a terminal:
poetry install
Copy config/.env/sample
into config/.env
file and set environment variables accordingly. Check the environment variables section for details.
Define the pipeline manifest for your REST API you're looking to extract data from. Check how to define an API pipeline manifest in Defining an API Pipeline Manifest for details, or take a look at the in-depth review of the sample manifests available in API Examples.
Set the REST API key in a config/secrets/api_key
file, or specify it using the --api-key
as argument to the CLI.
Get the base URL of your Qdrant Vector DB or deploy a local Qdrant
(Docs) vector database instance using docker:
# IMPORTANT: make sure you use `qdrant:v1.10.1` for compatibility with Gaianet node
docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1
Get your Gaianet node running (Docs) or install Ollama (Docs) provider locally. The latter is recommended if you're looking to run the pipeline on consumer hardware.
Load the LLM embeddings model of your preference into the LLM provider you chose in the previous step:
nomic-embed-text-v1.5.f16.gguf
(Source)Modelfile
and paste the following (replace <path/to/model>
with your local directory):FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
ollama create Nomic-embed-text-v1.5
ollama show Nomic-embed-text-v1.5
Now you're ready to use the rag-api-pipeline
CLI commands to execute different tasks of the RAG pipeline, from extracting data from an API source to generating vector embeddings and a database snapshot. If you need more details about the parameters available on each command you can execute:
poetry run rag-api-pipeline <command> --help
Below you can find the default instructions available and an in-depth review of both the functionality and available arguments that each command offers:
# run the entire pipeline
poetry run rag-api-pipeline run-all API_MANIFEST_FILE ----openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
# or run using an already normalized dataset
poetry run rag-api-pipeline from-normalized API_MANIFEST_FILE --normalized-data-file <jsonl-file> [--llm-provider openapi|ollama]
# or run using an already chunked dataset
poetry run rag-api-pipeline from-chunked API_MANIFEST_FILE --chunked-data-file <jsonl-file> [--llm-provider openapi|ollama]
run-all: executes the entire RAG data pipeline including API endpoint data streams, data normalization, data chunking, vector embeddings and database snapshot generation. You can specify the following arguments to the command:
API_MANIFEST_FILE
: API pipeline manifest file (mandatory)--llm-provider [ollama|openai]
: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)--api-key
: API Auth key. If not specified, it will try to get it from config/secrets/api_key
--openapi-spec-file
: API OpenAPI YAML spec file. default to config/openapi.yaml
--source-manifest-file
: Airbyte API Connector YAML manifest. If specified, it will omit the API Connector manifest generation step.--full-refresh
: clean up cache and extract API data from scratch.--normalized-only
: run pipeline until the data normalization stage.--chunked-only
: run pipeline until the data chunking stage.from-normalized: executes the RAG data pipeline using an already normalized JSONL dataset. You can specify the following arguments to the command:
API_MANIFEST_FILE
: API pipeline manifest file (mandatory)--llm-provider [ollama|openai]
: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)--normalized-data-file
: path to the normalized dataset in JSONL format (mandatory). Check the Architecture section for details on the required data schema.from chunked: executes the RAG data pipeline using an already chunked dataset in JSONL format. You can specify the following arguments to the command:
API_MANIFEST_FILE
: API pipeline manifest file (mandatory)--llm-provider [ollama|openai]
: backend embeddings model provider. default: openai-like backend (e.g. gaia rag-api-server)--chunked-data-file
: path to the chunked dataset in JSONL format (mandatory). Check the Architecture section for details on the required data schema.Cached API stream data and results produced from running any of the CLI commands are stored in <OUTPUT_FOLDER>/<api_name>
. The following files and folders are created by the tool within this baseDir
folder:
{baseDir}/{api_name}_source_generated.yaml
: generated Airbyte Stream connector manifest.{baseDir}/cache/{api_name}/*
: extracted API data is cached into a local DuckDB. Database files are stored in this directory. If the --full-refresh
argument is specified to the run-all
command, the cache will be cleared and API data will be extracted from scratch.{baseDir}/{api_name}_stream_{x}_preprocessed.jsonl
: data streams from each API endpoint is preprocessed and stored in JSONL format{baseDir}/{api_name}_normalized.jsonl
: preprocessed data streams from each API endpoint are joined together and stored in JSONL format{baseDir}/{api_name}_chunked.jsonl
: normalized data that goes through the data chunking stage is then stored in JSONL format{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot
: vector embeddings snapshot file that was exported from Qdrant DB{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz
: compressed knowledge base that contains the vector embeddings snapshotThe following environment variables can be adjusted in config/.env
based on user needs:
Pipeline config parameters:
API_DATA_ENCODING
(="utf-8"): default data encoding used by the REST APIOUTPUT_FOLDER
(="./output"): output folder where cached data streams, intermediary stage results and generated knowledge base snapshot are storedLLM provider settings:
LLM_API_BASE_URL
(="http://localhost:8080/v1"): LLM provider base URL (default to a local openai-based provider such as gaia node)LLM_API_KEY
(="empty-api-key"): API key to authenticate requests to the LLM providerLLM_EMBEDDINGS_MODEL
(="Nomic-embed-text-v1.5"): name of the embeddings model to be consumed through the LLM providerLLM_EMBEDDINGS_VECTOR_SIZE
(=768): embeddings vector sizeLLM_PROVIDER
(="openai"): LLM provider backend to use. It can be either openai
or ollama
(gaianet offers an openai compatible API)Qdrant DB settings:
QDRANTDB_URL
(="http://localhost:6333"): Qdrant DB base URLQDRANTDB_TIMEOUT
(=60): timeout for requests made to the Qdrant DBQDRANTDB_DISTANCE_FN
(="COSINE"): score function to use during vector similarity search. Avaiable functions: ['COSINE', 'EUCLID', 'DOT', 'MANHATTAN']Pathway-related variables:
AUTOCOMMIT_DURATION_MS
(=1000): the maximum time between two commits. Every autocommit_duration_ms milliseconds, the updates received by the connector are committed automatically and pushed into Pathway's dataflow. More info can be found herePATHWAY_RETRY_MAX_ATTEMPTS
(=10): max retries to be performed if a UDF async execution failsPATHWAY_RETRY_DELAY_MS
(=1000): delay in milliseconds to wait for the next execution attemptCHUNKING_BATCH_CAPACITY
(=0): max No. of concurrent operation during data chunking operationsEMBEDDINGS_BATCH_CAPACITY
(=15): max No. of concurrent operation during vector embeddings operationsTBD
docker compose -f local.yml build
.docker compose -f prod.yml build
docker compose -f prod.yml rm -svf
docker compose -f prod.yml up
pillow-heif
missinng module:
export CFLAGS="-Wno-nullability-completeness"
brew install libmagic
pip install python-magic-bin
This project uses Vocs framework for generating the Documentation site. If you want to run it locally and contribute, you should run the following commands:
pnpm install
pnpm run dev
To reflect any updates on https://raid-guild.github.io/gaianet-rag-api-pipeline/, you need to build and deploy the updated documentation on Github pages by executing the following commands:
pnpm run build
pnpm run deploy
Presentation slides can be found here
🛠️ Built 🛠️ with ❤️ by RaidGuild