DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define LLM-powered operations on complex data.
DocETL is the ideal choice when you're looking to maximize correctness and output quality for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:
pip install docetl
To see examples of how to use DocETL, check out the tutorial.
We offer a simple UI for building pipelines. We recommend building up complex pipelines one operation at a time, so you can see the results of each operation as you go and iterate on your pipeline. To run it locally, follow these steps:
Clone the repository:
git clone https://github.com/ucbepic/docetl.git
cd docetl
Set up environment variables in .env
in the root/top-level directory:
OPENAI_API_KEY=your_api_key_here
BACKEND_ALLOW_ORIGINS=
BACKEND_HOST=localhost
BACKEND_PORT=8000
BACKEND_RELOAD=True
FRONTEND_HOST=0.0.0.0
FRONTEND_PORT=3000
And create an .env.local file in the website
directory with the following:
OPENAI_API_KEY=sk-xxx
OPENAI_API_BASE=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
NEXT_PUBLIC_BACKEND_HOST=localhost
NEXT_PUBLIC_BACKEND_PORT=8000
make install # Install Python package
make install-ui # Install UI dependencies
Note that the openai api key, base, and model name are for the UI assistant only; not the DocETL pipeline execution engine.
Start the development server:
make run-ui-dev
If you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:
make tests-basic # Runs basic test suite (costs < $0.01 with OpenAI)
For detailed documentation and tutorials, visit our documentation.