raid-guild / gaianet-rag-api-pipeline

Supercharge your Gaianet node by generating a vector knowledge base from any API. Demo slides: https://hackmd.io/@santteegt/ByoykY4nC#/ Link to Docs below
https://raid-guild.github.io/gaianet-rag-api-pipeline/
MIT License
1 stars 0 forks source link

GaiaNet x RAG API Pipeline

rag-api-pipeline is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG (Retrieval Augmented Generation).

The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use this tool, tech stack and/or how it works under the hood, check the content menu on the left.

System Requirements

Setup Instructions

1. Clone this repository

Git clone or download this repository to your local.

2. Activate your virtual environment

If using a custom virtual environment, you should activate your virtual environment, otherwise poetry will handle the environment for you.

3. Install project dependencies

Navigate to the directory where this repository was cloned/download and execute the following on a terminal:

poetry install

4. Set environment variables

Copy config/.env/sample into config/.env file and set environment variables accordingly. Check the environment variables section for details.

5. Define your API Pipeline manifest

Define the pipeline manifest for your REST API you're looking to extract data from. Check how to define an API pipeline manifest in Defining an API Pipeline Manifest for details, or take a look at the in-depth review of the sample manifests available in API Examples.

6. Set the REST API Key

Set the REST API key in a config/secrets/api_key file, or specify it using the --api-key as argument to the CLI.

7. Setup a Qdrant DB instance

Get the base URL of your Qdrant Vector DB or deploy a local Qdrant (Docs) vector database instance using docker:

# IMPORTANT: make sure you use `qdrant:v1.10.1` for compatibility with Gaianet node
docker run -p 6333:6333 -p 6334:6334 -v ./qdrant_dev:/qdrant/storage:z qdrant/qdrant:v1.10.1

8. Select and Setup an LLM provider

Get your Gaianet node running (Docs) or install Ollama (Docs) provider locally. The latter is recommended if you're looking to run the pipeline on consumer hardware.

9. Load an LLM embeddings model

Load the LLM embeddings model of your preference into the LLM provider you chose in the previous step:

FROM <path/to/model>/nomic-embed-text-v1.5.f16.gguf
ollama create Nomic-embed-text-v1.5
ollama show Nomic-embed-text-v1.5

Pipeline CLI

Now you're ready to use the rag-api-pipeline CLI commands to execute different tasks of the RAG pipeline, from extracting data from an API source to generating vector embeddings and a database snapshot. If you need more details about the parameters available on each command you can execute:

poetry run rag-api-pipeline <command> --help

CLI available commands

Below you can find the default instructions available and an in-depth review of both the functionality and available arguments that each command offers:

# run the entire pipeline
poetry run rag-api-pipeline run-all API_MANIFEST_FILE ----openapi-spec-file <openapi-spec-yaml-file> [--full-refresh] [--llm-provider openapi|ollama]
# or run using an already normalized dataset
poetry run rag-api-pipeline from-normalized API_MANIFEST_FILE --normalized-data-file <jsonl-file> [--llm-provider openapi|ollama]
# or run using an already chunked dataset
poetry run rag-api-pipeline from-chunked API_MANIFEST_FILE --chunked-data-file <jsonl-file> [--llm-provider openapi|ollama]

CLI Output

Cached API stream data and results produced from running any of the CLI commands are stored in <OUTPUT_FOLDER>/<api_name>. The following files and folders are created by the tool within this baseDir folder:

Environment variables

The following environment variables can be adjusted in config/.env based on user needs:

Using Docker compose for Local development or in Production

TBD

Troubleshooting

Workaround in case of missing one of the following dependencies

Documentation

This project uses Vocs framework for generating the Documentation site. If you want to run it locally and contribute, you should run the following commands:

pnpm install
pnpm run dev

To reflect any updates on https://raid-guild.github.io/gaianet-rag-api-pipeline/, you need to build and deploy the updated documentation on Github pages by executing the following commands:

pnpm run build
pnpm run deploy

Demo

Presentation slides can be found here

License

MIT

Authors

🛠️ Built 🛠️ with ❤️ by RaidGuild