Create comprehensive documentation and usage examples for the Reader Microservice.
Description
This issue involves creating detailed documentation and examples for the Reader Microservice to help users understand how to use it effectively. Documentation will include architecture overview, installation instructions, usage examples, and performance considerations.
Implementation Details
1. Create main README for Reader Microservice
Create docs/reader/README.md:
# khive Reader Microservice
A modular document processing and semantic search engine built on pgvector, MinIO, and Pydapter.
## Overview
The Reader Microservice provides document ingestion, processing, and semantic search capabilities through a simple CLI interface. It's designed to start as a modular monolith for simplicity, with clear boundaries for potential microservice extraction as scale demands.
### Key Features
- **Document Ingestion**: Upload documents from URLs or local files
- **Text Extraction**: Extract text from PDFs, DOCXs, HTML, and more
- **Vector Search**: Find semantically similar content across documents
- **Embedding Generation**: Create embeddings using OpenAI or local models
- **Performance Monitoring**: Track system metrics and set thresholds
### Architecture
```ascii
┌─────────────────────────────────────────────────────────────┐
│ Reader Service │
│ │
│ ┌─────────────┐ ┌────────────┐ ┌───────────────────┐ │
│ │ Document │ │ Processing │ │ Vector Search │ │
│ │ Ingestion │◄─►│ Pipeline │◄─►│ & Retrieval │ │
│ │ Module │ │ Module │ │ Module │ │
│ └─────────────┘ └────────────┘ └───────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Data Access Layer │ │
│ │ (Pydapter repositories for data persistence) │ │
│ └───────────────────────────────────────────────────┘ │
│ │ │ │ │
└─────────┼───────────────┼─────────────────────┼──────────────┘
│ │ │
┌──────▼───────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ MinIO/S3 │ │ PostgreSQL│ │ Background │
│ Object Store│ │ + pgvector│ │ Task Queue │
└──────────────┘ └───────────┘ └───────────┘
Installation
Prerequisites
Python 3.11+
PostgreSQL with pgvector extension installed
MinIO (or S3-compatible object storage)
OpenAI API key (optional, for embedding generation)
Vector Count: The system is optimized for up to 5 million vectors with pgvector
Search Latency: Typical p95 latency should be under 100ms
Scaling: Consider extracting components when performance thresholds are exceeded
Development
Running with Docker
docker-compose up -d
Testing
pytest tests/
License
MIT
### 2. Create Quickstart Guide
Create `docs/reader/quickstart.md`:
```markdown
# Reader Microservice Quickstart
This guide will help you get started with the khive Reader Microservice for document processing and semantic search.
## 5-Minute Setup
### 1. Install and Configure
```bash
# Install khive with reader components
pip install khive
# Start PostgreSQL and MinIO with Docker
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres -e POSTGRES_DB=khive_reader ankane/pgvector:latest
docker run -d --name minio -p 9000:9000 -p 9001:9001 -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin minio/minio server /data --console-address ":9001"
# Set environment variables
export KHIVE_READER_DB_URL="postgresql+asyncpg://postgres:postgres@localhost/khive_reader"
export KHIVE_READER_STORAGE_ENDPOINT="http://localhost:9000"
export KHIVE_READER_STORAGE_ACCESS_KEY="minioadmin"
export KHIVE_READER_STORAGE_SECRET_KEY="minioadmin"
export KHIVE_READER_STORAGE_BUCKET="khive-reader"
2. Initialize Database
# Initialize the database schema and extensions
khive db ping # Verify connection
alembic upgrade head
3. Ingest a Document
# Ingest a PDF from the web
khive reader ingest --url https://arxiv.org/pdf/2303.08774.pdf
# Sample output:
# ✓ Document ingested successfully
# Document ID: 123e4567-e89b-12d3-a456-426614174000
4. Search the Document
# Search across all documents
khive reader search "large language models"
# Sample output:
# Search results for: "large language models"
# Found 3 matches
#
# 1. [Similarity: 92.5%]
# Document ID: 123e4567-e89b-12d3-a456-426614174000
# Text: Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding...
The system uses Prometheus metrics and defined thresholds to monitor:
Vector count (threshold: 5M vectors)
Search latency (threshold: 100ms p95)
Task queue depth (threshold: 1000 tasks)
Database connection utilization (threshold: 90%)
Evolution Path
As scale increases, the system can evolve:
Initial State: Modular monolith with PostgreSQL+pgvector
First Extraction: Move vector search to dedicated Qdrant service
Second Extraction: Separate document processing into async workers
Final State: Full microservice architecture with optimized components
5. Update Main khive README
Update the main README.md to include information about the Reader Microservice:
# Add to the existing README.md under "Command Catalogue"
| Command | What it does |
|-----------------|------------------------------------------------------------------------------------------------------------------|
| `khive reader` | Document processing and semantic search engine with vector embeddings. Ingest documents and find similar content. |
# Add to "Usage Examples" section
# ingest a document from URL
khive reader ingest --url https://example.com/document.pdf
# search across all documents
khive reader search "quantum computing applications"
# check reader performance thresholds
khive reader performance
Testing
Ensure documentation is accurate by:
Running all example commands to verify they work as described
Validating architectural diagrams match the implemented system
Testing installation instructions on a fresh environment
Acceptance Criteria
[x] Main README for Reader Microservice
[x] Quickstart guide with step-by-step instructions
[x] Example notebook for interactive usage
[x] Architecture documentation with diagrams
[x] Update to main khive README
[x] All examples tested and verified to work
Dependencies
23 Add Pydapter core & pgvector plugin
24 Define canonical domain models for Reader Microservice
25 Bootstrap persistence layer with Pydapter repositories
26 Implement khive reader ingest command
27 Implement background worker and document processing pipeline
This issue is currently out of scope for today's focus on core async, connections, endpoints, and their documentation. It will be addressed in a future iteration focusing on the Reader Microservice.
Objective
Create comprehensive documentation and usage examples for the Reader Microservice.
Description
This issue involves creating detailed documentation and examples for the Reader Microservice to help users understand how to use it effectively. Documentation will include architecture overview, installation instructions, usage examples, and performance considerations.
Implementation Details
1. Create main README for Reader Microservice
Create
docs/reader/README.md
:Installation
Prerequisites
Setup
Usage
Ingesting Documents
Searching Documents
Monitoring Performance
Performance Considerations
Development
Running with Docker
Testing
License
MIT
2. Initialize Database
3. Ingest a Document
4. Search the Document
Next Steps
Troubleshooting
Common Issues
PostgreSQL Connection Error:
Solution: Ensure PostgreSQL is running and the connection URL is correct.
MinIO Connection Error:
Solution: Verify MinIO is running and credentials are correct.
Embedding Generation Error:
Solution: Set OPENAI_API_KEY or verify connectivity to OpenAI API.
Getting Help
If you encounter issues, please:
3. Create Example Notebook
Create
docs/reader/examples/basic_usage.ipynb
as a Jupyter notebook with examples of using the Reader Microservice.4. Create Architecture Documentation
Create
docs/reader/architecture.md
:These models are automatically converted to SQLAlchemy tables for PostgreSQL.
Document Ingestion Flow
khive reader ingest --url <url>
Document Processing Flow
Search Flow
khive reader search "<query>"
Data Access Layer
The data access layer uses Pydapter repositories:
Monitoring and Observability
The system uses Prometheus metrics and defined thresholds to monitor:
Evolution Path
As scale increases, the system can evolve:
5. Update Main khive README
Update the main
README.md
to include information about the Reader Microservice:Testing
Ensure documentation is accurate by:
Acceptance Criteria
Dependencies
23 Add Pydapter core & pgvector plugin
24 Define canonical domain models for Reader Microservice
25 Bootstrap persistence layer with Pydapter repositories
26 Implement khive reader ingest command
27 Implement background worker and document processing pipeline
28 Implement khive reader search command
29 Add observability and performance thresholds
Estimated Effort
Medium (2 days)