tjmlabs / ColiVara-Eval

Colivara Evaluation
MIT License
2 stars 0 forks source link

Colivara Evaluation Project

Evaluation Results

This repository contains a comprehensive evaluation of the Colivara API for document management, search, and retrieval, using a Retrieval-Augmented Generation (RAG) model. This evaluation aims to assess Colivara's capabilities in managing document collections, performing efficient search operations, and calculating relevance metrics to measure performance.

Benchmark Colivara vidore_colqwen2-v1.0 (Current Leader) vidore_colpali-v1.3 vidore_colpali
Average 87.6 ↓ 89.3 84.8 81.3
Tat DQA 71.7 ↓ 81.4 70.4 65.8
Shift Project 91.3 ↑ 90.7 77.4 73.2
Artificial Intelligence 99.5 ↑ 99.4 97.4 96.2
Government Reports 96.7 ↑ 96.3 96.2 92.7
ArxivQA 88.1 ↑ 88.1 83.0 79.1
DocVQA 56.1 ↓ 60.6 58.5 54.4
Healthcare Industry 98.3 ↑ 98.1 96.9 94.4
InfoVQA 91.4 ↓ 92.6 85.7 81.8
Energy 96.3 ↑ 95.9 95.4 91.0
TabFQuad 86.3 ↓ 89.5 87.4 83.9

Table of Contents


Project Overview

The goal of this project is to evaluate Colivara’s document retrieval and management features, particularly for applications that rely on high-performance data search and retrieval. This includes testing Colivara's collection and document management, assessing its suitability for various search and retrieval scenarios, and benchmarking the platform with a RAG model to evaluate relevance based on real-world queries.

Evaluation Results

Below are the summarized evaluation results for the Colivara API performance based on NDCG metrics:

Benchmark Colivara Score Avg Latency (s) Num Docs
Average 87.6 ---- ----
ArxivQA 88.1 11.1 500
DocVQA 56.1 9.3 500
InfoVQA 91.4 8.6 500
Shift Project 91.3 16.8 1000
Artificial Intelligence 99.5 12.8 1000
Energy 96.3 14.1 1000
Government Reports 96.7 14.0 1000
Healthcare Industry 98.3 20.0 1000
TabFQuad 86.3 8.1 280
TatQA 71.7 20.0 1663

Features

Requirements

Dependencies

The required Python packages are listed in requirements.txt, including:

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/colivara-evaluation.git
    cd colivara-evaluation
  2. Install the dependencies:

    pip install -r requirements.txt
  3. Configure Environment Variables:

    • Create a .env file in the root directory.
    • Add the following variables:
      COLIVARA_API_KEY=your_api_key_here
      COLIVARA_BASE_URL=https://api.colivara.com

Usage

The Colivara Evaluation Project provides a streamlined interface for managing and evaluating document collections within Colivara. The primary entry points for usage are main.py for performing document upsert operations and evaluate.py for relevance evaluation.

Document Upsert with main.py

The main.py script enables you to upsert documents into Colivara collections. It allows selective processing of single datasets or batch processing across all available datasets, making it adaptable for various scenarios.

Key Arguments

Example Commands

1. Upserting a Single Dataset

To upsert documents from a specific dataset, run:

python main.py --specific_file arxivqa_test_subsampled.pkl --collection_name arxivqa_collection --upsert

This command will upsert all documents from arxivqa_test_subsampled.pkl into arxivqa_collection if it doesn’t already exist.

2. Upserting All Datasets

To upsert documents for all datasets:

python main.py --all_files --upsert

This command will loop through all datasets in DOCUMENT_FILES, upserting documents into their corresponding collections.

Relevance Evaluation with evaluate.py

The evaluate.py script is used to evaluate the relevance of document collections within Colivara.

Key Arguments

Example Commands

1. Evaluating a Single Collection

To evaluate the relevance of a specific collection, run:

python evaluate.py --api_key "your_api_key_here" --collection_name arxivqa_collection

This command will evaluate the specified collection and output the relevance metrics based on NDCG@5.

2. Evaluating All Collections

To evaluate the relevance of all collections:

python evaluate.py --api_key "your_api_key_here" --all_files

This command will perform a relevance evaluation (NDCG@5) on all datasets listed in DOCUMENT_FILES and save the results in the out/ directory:

Collection Management with collection_manager.py

The collection_manager.py script provides utilities for listing and deleting collections within Colivara.

Commands

File Structure

Configuration

The project configuration relies on environment variables defined in a .env file:

Use dotenv to load these configurations automatically, ensuring that sensitive information is securely managed.

Technical Details

Discounted Cumulative Gain (DCG)

DCG is a measure of relevance that considers the position of relevant results in the returned list. It assigns higher scores to results that appear earlier.

Normalized Discounted Cumulative Gain (NDCG)

NDCG normalizes DCG by dividing it by the ideal DCG (IDCG) for a given query, providing a score between 0 and 1. In this project, we calculate NDCG@5 to evaluate the top 5 search results for each query.

Search Query Evaluation

The evaluation process includes:

  1. Query Processing: Matching queries against document metadata.
  2. Relevance Scoring: Using true document IDs to calculate relevance scores.
  3. NDCG Calculation: Aggregating scores to calculate the average relevance.

Future Enhancements

  1. Parallel Processing: Optimize data loading and evaluation functions for concurrent processing.
  2. Extended Metrics: Add other evaluation metrics like Mean Reciprocal Rank (MRR).
  3. Benchmarking with Larger Datasets: Test Colivara's scalability with larger data volumes.
  4. Automated Testing: Integrate unit and integration tests for CI/CD compatibility.

License

This project is licensed under the MIT License - see the LICENSE file for details.