Indexify simplifies building and serving durable, multi-stage workflows as inter-connected Python functions and automagically deploys them as APIs.
A workflow encodes data ingestion and transformation stages that can be implemented using Python functions. Each of these functions is a logical compute unit that can be retried upon failure or assigned to specific hardware.
To give you a taste of the project, in the above video - Indexify running PDF Extraction on a cluster of 3 machines. top left - A GPU accelerated machine running document layout and OCR model on a PDF, bottom left - chunking texts, embedding image and text using CLIP and a text embedding model. top right - A function writing image and text embeddings to ChromaDB. All three functions of the workflow are running in parallel and coordinated by the Indexify server.
[!NOTE]
Indexify is the Open-Source core compute engine that powers Tensorlake's Serverless Workflow Engine for processing unstructured data.
Indexify is a versatile data processing framework for all kinds of use cases, including:
Install Indexify's SDK and CLI into your development environment:
pip install indexify
Define a workflow by implementing its data transformation as composable Python functions. Functions decorated with @indexify_function()
. These functions form the edges of a Graph
, which is the representation of a compute graph.
Functions serve as discrete units within a Graph, defining the boundaries for retry attempts and resource allocation. They separate computationally heavy tasks like LLM inference from lightweight ones like database writes.
The example below is a pipeline that calculates the sum of squares for the first consecutive whole numbers.
from pydantic import BaseModel
from indexify import indexify_function, indexify_router, Graph
from typing import List, Union
class Document(BaseModel):
pages: List[str]
# Parse a pdf and extract text
@indexify_function()
def process_document(file: File) -> Document:
# Process a PDF and extract pages
class TextChunk(BaseModel):
chunk: str
page_number: int
# Chunk the pages for embedding and retreival
@indexify_function()
def chunk_document(document: Document) -> List[TextChunk]:
# Split the pages
# Embed a single chunk.
# Note: (Automatic Map) Indexify automatically parallelize functions when they consume an element
# from functions that produces a List
@indexify_functions()
def embed_and_write(chunk: TextChunk) -> ChunkEmbedding:
# run an embedding model on the chunk
# write_to_db
# Constructs a compute graph connecting the three functions defined above into a workflow that generates
# runs them as a pipeline
graph = Graph(name="document_ingestion_pipeline", start_node=process_document, description="...")
graph.add_edge(process_document, chunk_document)
graph.add_edge(chunk_document, embed_and_write)
Read the Docs to learn more about how to test, deploy and create API endpoints for Workflows.
yield
keyword.