ucbepic / docetl

A system for agentic LLM-powered data processing and ETL
https://docetl.org
MIT License
785 stars 78 forks source link

Implement LLM-based Document Splitting #78

Open shreyashankar opened 15 hours ago

shreyashankar commented 15 hours ago

As requested by a member of the community, it would be cool to implement a new feature for splitting documents using an LLM nstead of our current token or delimiter-based methods. This will allow for more intelligent and context-aware splitting of documents.

Proposed Idea

  1. Implement an LLM "scan" operation that can process a document and determine contiguous splits based on specified criteria.
  2. Allow users to provide a split_criteria_prompt that describes how to split the document (e.g., by topic).
  3. Use a scratchpad technique (similar to our reduce operation) to manage internal state/memory while splitting.

Technical Approach

  1. Feed as much text as possible into the LLM.
  2. Ask the LLM to output:
    • As many split points as it's confident in (phrases of 5-10 tokens that we can search in the document to split)
    • Any memory/state it wants to keep track of for splitting the next part of the document
  3. Remove processed chunks from the document.
  4. Repeat the process until the entire document is processed.

Considerations

  1. Splitting strategy:
    • All splits in one call
    • One split at a time
    • K splits at a time
    • As many splits as the LLM can confidently provide
  2. Balancing split quality with processing efficiency
  3. Handling very large documents that exceed LLM context limits
  4. Ensuring consistency in splitting criteria across multiple LLM calls

Proposed Interface Design

operations:
  - name: llm_split
    type: split
    split_criteria: "split by theme discussed"
staru09 commented 2 hours ago

How to go with this (https://python.langchain.com/docs/how_to/#text-splitters) and then use it with some LLM for splitting.