DocETL: Powering Complex Document Processing Pipelines

DocETL Figure

DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define LLM-powered operations on complex data.

When to Use DocETL

DocETL is the ideal choice when you're looking to maximize correctness and output quality for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:

You want to perform semantic processing on a collection of data
You have complex tasks that you want to represent via map-reduce
You're unsure how to best express your task to maximize LLM accuracy
You're working with long documents that don't fit into a single prompt
You have validation criteria and want tasks to automatically retry when validation fails

Community Projects

Educational Resources

Installation

Prerequisites

Python 3.10 or later
OpenAI API key

Quick Start

Install from PyPI:
```
pip install docetl
```

To see examples of how to use DocETL, check out the tutorial.

Running the UI Locally

We offer a simple UI for building pipelines. We recommend building up complex pipelines one operation at a time, so you can see the results of each operation as you go and iterate on your pipeline. To run it locally, follow these steps:

Playground Screenshot

Clone the repository:

git clone https://github.com/ucbepic/docetl.git
cd docetl

Set up environment variables in .env in the root/top-level directory:

OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000

And create an .env.local file in the website directory with the following:

OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini

NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000

Install dependencies:

make install      # Install Python package
make install-ui   # Install UI dependencies

Note that the openai api key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.

Start the development server:
```
make run-ui-dev
```
Visit http://localhost:3000/playground

Development Setup

If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:

make tests-basic  # Runs basic test suite (costs < $0.01 with OpenAI)

For detailed documentation and tutorials, visit our documentation.

ucbepic / docetl

readme