ucbepic / docetl

A system for agentic LLM-powered data processing and ETL
https://docetl.org
MIT License
1.27k stars 117 forks source link

Support batching in map operations #7

Closed shreyashankar closed 2 weeks ago

shreyashankar commented 2 months ago

Support Batching in Map Operations

Background

Currently, map operations execute an LLM call per input document. For very small documents, it's plausible we can execute on multiple docs together to improve efficiency. We want to introduce batching capabilities to our map operations.

Goal

Implement batching support in map operations to potentially improve performance and reduce costs when dealing with small documents.

To-Do List

  1. Modify the map operation interface to include batch-related parameters:

    • Add batch_size parameter
    • Add clustering_method parameter (options: 'random', 'sem_cluster')
  2. Implement batching logic in map operations:

    • Group documents based on batch_size and clustering_method
    • Modify LLM call to handle batched inputs
    • Ensure output is correctly mapped back to individual documents
  3. Implement clustering methods:

    • Random
    • Semantic clustering (consider using embeddings)
  4. Update the optimizer (builder.py) to handle batch size optimization:

    • Add logic to find the ideal batch size that doesn't compromise accuracy
    • Implement a method to evaluate accuracy vs. batch size
  5. Modify the YAML config format to support batch-related parameters:

    • Add fields for batch_size and clustering_method
    • Ensure backwards compatibility (i.e., with validation and gleaning)
  6. Write unit tests for batching functionality:

    • Test different batch sizes
    • Test clustering methods
    • Test accuracy preservation
  7. Update documentation:

    • Add explanation of batching in map operations
    • Provide examples of how to use and configure batching
    • Document the trade-offs and considerations for batch size selection

(Proposed) Config Example

operations:
  - type: map
    name: classify_sentiment
    batch_size: 10
    clustering_method: sem_cluster
    model: gpt-4o-mini
    prompt: "Classify the sentiment of the following text: {{ input.text }}"

Done When

Notes

shreyashankar commented 1 month ago

We currently have max_batch_size as a limit on parallelism. But we don't yet submit multiple map operation llm calls as a single call, as this requires prompt engineering to get all the operations in a single prompt & extra validation to make sure all the outputs exist, for each document in the batch.

This is something the optimizer may have to do.