Support Batching in Map Operations

Background

Currently, map operations execute an LLM call per input document. For very small documents, it's plausible we can execute on multiple docs together to improve efficiency. We want to introduce batching capabilities to our map operations.

Goal

Implement batching support in map operations to potentially improve performance and reduce costs when dealing with small documents.

To-Do List

Modify the map operation interface to include batch-related parameters:
- Add batch_size parameter
- Add clustering_method parameter (options: 'random', 'sem_cluster')
Implement batching logic in map operations:
- Group documents based on batch_size and clustering_method
- Modify LLM call to handle batched inputs
- Ensure output is correctly mapped back to individual documents
Implement clustering methods:
- Random
- Semantic clustering (consider using embeddings)
Update the optimizer (builder.py) to handle batch size optimization:
- Add logic to find the ideal batch size that doesn't compromise accuracy
- Implement a method to evaluate accuracy vs. batch size
Modify the YAML config format to support batch-related parameters:
- Add fields for batch_size and clustering_method
- Ensure backwards compatibility (i.e., with validation and gleaning)
Write unit tests for batching functionality:
- Test different batch sizes
- Test clustering methods
- Test accuracy preservation
Update documentation:
- Add explanation of batching in map operations
- Provide examples of how to use and configure batching
- Document the trade-offs and considerations for batch size selection

(Proposed) Config Example

operations:
  - type: map
    name: classify_sentiment
    batch_size: 10
    clustering_method: sem_cluster
    model: gpt-4o-mini
    prompt: "Classify the sentiment of the following text: {{ input.text }}"

Done When

[x] Map operations support batching with configurable batch_size and clustering_method
[ ] Optimizer can find ideal batch size balancing efficiency and accuracy
[ ] YAML config format updated to include batch-related parameters
[x] Unit tests cover batching functionality and accuracy preservation
[x] Documentation updated with batching explanations and examples

Notes

Consider the impact on token limits when batching documents
Consider how this works with validation & gleaning, which is defined at the document level

ucbepic / docetl

Support batching in map operations #7