khive-ai / khive.d

Autonomous software engineering department with github/roo
Apache License 2.0
10 stars 2 forks source link

Add documentation and examples for Reader Microservice #30

Closed ohdearquant closed 3 weeks ago

ohdearquant commented 1 month ago

Objective

Create comprehensive documentation and usage examples for the Reader Microservice.

Description

This issue involves creating detailed documentation and examples for the Reader Microservice to help users understand how to use it effectively. Documentation will include architecture overview, installation instructions, usage examples, and performance considerations.

Implementation Details

1. Create main README for Reader Microservice

Create docs/reader/README.md:

# khive Reader Microservice

A modular document processing and semantic search engine built on pgvector, MinIO, and Pydapter.

## Overview

The Reader Microservice provides document ingestion, processing, and semantic search capabilities through a simple CLI interface. It's designed to start as a modular monolith for simplicity, with clear boundaries for potential microservice extraction as scale demands.

### Key Features

- **Document Ingestion**: Upload documents from URLs or local files
- **Text Extraction**: Extract text from PDFs, DOCXs, HTML, and more
- **Vector Search**: Find semantically similar content across documents
- **Embedding Generation**: Create embeddings using OpenAI or local models
- **Performance Monitoring**: Track system metrics and set thresholds

### Architecture

```ascii
┌─────────────────────────────────────────────────────────────┐
│                   Reader Service                            │
│                                                             │
│  ┌─────────────┐   ┌────────────┐   ┌───────────────────┐   │
│  │ Document    │   │ Processing │   │ Vector Search     │   │
│  │ Ingestion   │◄─►│ Pipeline   │◄─►│ & Retrieval       │   │
│  │ Module      │   │ Module     │   │ Module            │   │
│  └─────────────┘   └────────────┘   └───────────────────┘   │
│         │               │                     │              │
│         ▼               ▼                     ▼              │
│     ┌───────────────────────────────────────────────────┐   │
│     │           Data Access Layer                        │   │
│     │    (Pydapter repositories for data persistence)    │   │
│     └───────────────────────────────────────────────────┘   │
│         │               │                     │              │
└─────────┼───────────────┼─────────────────────┼──────────────┘
          │               │                     │
   ┌──────▼───────┐ ┌─────▼─────┐        ┌─────▼─────┐
   │  MinIO/S3    │ │ PostgreSQL│        │ Background │
   │  Object Store│ │ + pgvector│        │ Task Queue │
   └──────────────┘ └───────────┘        └───────────┘

Installation

Prerequisites

Setup

  1. Install the package:
pip install khive
  1. Set up environment variables:
# PostgreSQL connection
export KHIVE_READER_DB_URL="postgresql+asyncpg://user:password@localhost/khive_reader"

# MinIO/S3 connection
export KHIVE_READER_STORAGE_ENDPOINT="http://localhost:9000"
export KHIVE_READER_STORAGE_ACCESS_KEY="minioadmin"
export KHIVE_READER_STORAGE_SECRET_KEY="minioadmin"
export KHIVE_READER_STORAGE_BUCKET="khive-reader"

# OpenAI API (optional)
export OPENAI_API_KEY="your-key-here"
  1. Initialize the database:
khive db init

Usage

Ingesting Documents

# Ingest from URL
khive reader ingest --url https://example.com/document.pdf

# Ingest with metadata
khive reader ingest --url https://example.com/document.pdf --metadata metadata.json

# Get JSON output
khive reader ingest --url https://example.com/document.pdf --json

Searching Documents

# Basic search
khive reader search "quantum computing applications"

# Limit results
khive reader search "quantum computing applications" --top-k 3

# Search specific document
khive reader search "quantum computing applications" --document-id 123e4567-e89b-12d3-a456-426614174000

# Get JSON output
khive reader search "quantum computing applications" --json

Monitoring Performance

# Check performance thresholds
khive reader performance

# Get JSON output
khive reader performance --json

Performance Considerations

Development

Running with Docker

docker-compose up -d

Testing

pytest tests/

License

MIT


### 2. Create Quickstart Guide

Create `docs/reader/quickstart.md`:

```markdown
# Reader Microservice Quickstart

This guide will help you get started with the khive Reader Microservice for document processing and semantic search.

## 5-Minute Setup

### 1. Install and Configure

```bash
# Install khive with reader components
pip install khive

# Start PostgreSQL and MinIO with Docker
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres -e POSTGRES_DB=khive_reader ankane/pgvector:latest
docker run -d --name minio -p 9000:9000 -p 9001:9001 -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin minio/minio server /data --console-address ":9001"

# Set environment variables
export KHIVE_READER_DB_URL="postgresql+asyncpg://postgres:postgres@localhost/khive_reader"
export KHIVE_READER_STORAGE_ENDPOINT="http://localhost:9000"
export KHIVE_READER_STORAGE_ACCESS_KEY="minioadmin"
export KHIVE_READER_STORAGE_SECRET_KEY="minioadmin"
export KHIVE_READER_STORAGE_BUCKET="khive-reader"

2. Initialize Database

# Initialize the database schema and extensions
khive db ping  # Verify connection
alembic upgrade head

3. Ingest a Document

# Ingest a PDF from the web
khive reader ingest --url https://arxiv.org/pdf/2303.08774.pdf
# Sample output:
# ✓ Document ingested successfully
# Document ID: 123e4567-e89b-12d3-a456-426614174000

4. Search the Document

# Search across all documents
khive reader search "large language models"
# Sample output:
# Search results for: "large language models"
# Found 3 matches
# 
# 1. [Similarity: 92.5%]
#    Document ID: 123e4567-e89b-12d3-a456-426614174000
#    Text: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding...

Next Steps

Troubleshooting

Common Issues

PostgreSQL Connection Error:

Error: Could not connect to PostgreSQL database

Solution: Ensure PostgreSQL is running and the connection URL is correct.

MinIO Connection Error:

Error: Could not connect to MinIO endpoint

Solution: Verify MinIO is running and credentials are correct.

Embedding Generation Error:

Error: Could not generate embeddings

Solution: Set OPENAI_API_KEY or verify connectivity to OpenAI API.

Getting Help

If you encounter issues, please:

  1. Check the FAQ
  2. Review the GitHub issues
  3. Join our Discord community

3. Create Example Notebook

Create docs/reader/examples/basic_usage.ipynb as a Jupyter notebook with examples of using the Reader Microservice.

4. Create Architecture Documentation

Create docs/reader/architecture.md:

# Reader Microservice Architecture

## Design Philosophy

The Reader Microservice follows several key design principles:

1. **Start Simple**: Begin with a modular monolith for simplicity during development
2. **Clear Boundaries**: Maintain clean service interfaces for future separation
3. **DRY with Pydapter**: Single source of truth for data models
4. **Performance Metrics**: Track key indicators to guide scaling decisions

## Components

### Domain Models

The system uses Pydantic models annotated with Pydapter's `@orm_model` decorator:

```python
@orm_model
class Document(BaseModel):
    """Primary document metadata model."""
    id: UUID = Field(default_factory=uuid4)
    source_uri: str
    title: Optional[str] = None
    raw_file_uri: Optional[str] = None
    text_file_uri: Optional[str] = None
    status: DocumentStatus = DocumentStatus.PENDING
    error: Optional[str] = None
    metadata: Dict[str, Any] = Field(default_factory=dict)
    created_at: datetime = Field(default_factory=lambda: datetime.now(UTC))
    updated_at: datetime = Field(default_factory=lambda: datetime.now(UTC))

@orm_model
class DocumentChunk(BaseModel):
    """Chunk of text from a document with an embedding vector."""
    id: UUID = Field(default_factory=uuid4)
    document_id: UUID
    text: str
    sequence: int
    metadata: Dict[str, Any] = Field(default_factory=dict)
    embedding: Optional[Vector[1536]] = None

These models are automatically converted to SQLAlchemy tables for PostgreSQL.

Document Ingestion Flow

  1. User calls khive reader ingest --url <url>
  2. DocumentIngestionService:
    • Creates Document record in database
    • Downloads content from URL
    • Stores raw content in MinIO
    • Queues document for processing

Document Processing Flow

  1. AsyncTaskQueue picks up processing task
  2. DocumentProcessingService:
    • Extracts text from document using appropriate parser
    • Chunks text into smaller segments
    • Generates embeddings for each chunk
    • Stores chunks with embeddings in PostgreSQL

Search Flow

  1. User calls khive reader search "<query>"
  2. DocumentSearchService:
    • Generates embedding for query text
    • Performs vector similarity search using pgvector
    • Returns ranked results with context

Data Access Layer

The data access layer uses Pydapter repositories:

class Repository(Generic[T]):
    """Base repository using Pydapter's AsyncPostgresAdapter."""

    def __init__(self, model_cls: Type[T], db_session: AsyncSession):
        self.model_cls = model_cls
        self.db_session = db_session
        self.adapter = AsyncPostgresAdapter(self.model_cls, self.db_session)

    async def get(self, id: UUID) -> Optional[T]:
        return await self.adapter.from_obj(id)

    # More methods...

Monitoring and Observability

The system uses Prometheus metrics and defined thresholds to monitor:

Evolution Path

As scale increases, the system can evolve:

  1. Initial State: Modular monolith with PostgreSQL+pgvector
  2. First Extraction: Move vector search to dedicated Qdrant service
  3. Second Extraction: Separate document processing into async workers
  4. Final State: Full microservice architecture with optimized components

5. Update Main khive README

Update the main README.md to include information about the Reader Microservice:

# Add to the existing README.md under "Command Catalogue"

| Command         | What it does                                                                                                     |
|-----------------|------------------------------------------------------------------------------------------------------------------|
| `khive reader`  | Document processing and semantic search engine with vector embeddings. Ingest documents and find similar content. |

# Add to "Usage Examples" section

# ingest a document from URL
khive reader ingest --url https://example.com/document.pdf

# search across all documents
khive reader search "quantum computing applications" 

# check reader performance thresholds
khive reader performance

Testing

Ensure documentation is accurate by:

  1. Running all example commands to verify they work as described
  2. Validating architectural diagrams match the implemented system
  3. Testing installation instructions on a fresh environment

Acceptance Criteria

Dependencies

Estimated Effort

Medium (2 days)

ohdearquant commented 1 month ago

This issue is currently out of scope for today's focus on core async, connections, endpoints, and their documentation. It will be addressed in a future iteration focusing on the Reader Microservice.