rgarfield11 / text_chunker_ex

A library for semantically coherent text chunking
MIT License
0 stars 0 forks source link

Engineer - Text Chunker - Integrate Splitting Strategies #1

Open rgarfield11 opened 4 months ago

rgarfield11 commented 4 months ago

Background

Currently, our text-splitting library 'Text Chunker' utilizes a single strategy for splitting text: Recursive Split. This method was adapted from LangChain, and while it's effective, it doesn't cover the various text-splitting needs for different use cases. To address this gap and provide more flexibility to our users, we plan to introduce alternative strategies for text splitting that can accommodate a wider range of text types and splitting requirements.

Acceptance Criteria

Scenario: Implement new text-splitting strategies

Given the “Text Chunker” library currently supports only "Recursive Split"

rgarfield11 commented 4 months ago

To integrate multiple splitting strategies into the Text Chunker library, we should define additional modules that implement the Chunker.SplitterBehaviour. These new modules will be SimpleDelimiterSplit, RegularExpressionSplit, and FixedLengthSplit. Let's first define these modules with basic implementations, then update the TextChunker module to support strategy selection.

First, we define the new splitting strategies.

defmodule Chunker.Splitters.SimpleDelimiterSplit do
  @behaviour Chunker.SplitterBehaviour

  def split(text, opts) do
    delimiter = opts[:delimiter] || "\n"
    chunk_size = opts[:chunk_size]
    text
    |> String.split(delimiter)
    |> Enum.map(&String.trim/1) # Trim the chunks
    |> Enum.chunk_every(chunk_size, chunk_size, [], &Enum.join(&1, delimiter))
    |> Enum.map(&%Chunker.Chunk{text: &1})
  end
end

defmodule Chunker.Splitters.RegularExpressionSplit do
  @behaviour Chunker.SplitterBehaviour

  def split(text, opts) do
    regex = opts[:regex] || ~r/\s+/
    chunk_size = opts[:chunk_size]
    text
    |> Regex.split(regex)
    |> Enum.map(&String.trim/1)
    |> Enum.chunk_every(chunk_size, chunk_size, [], &Enum.join(&1, " "))
    |> Enum.map(&%Chunker.Chunk{text: &1})
  end
end

defmodule Chunker.Splitters.FixedLengthSplit do
  @behaviour Chunker.SplitterBehaviour

  def split(text, opts) do
    chunk_size = opts[:chunk_size]
    String.length(text)
    |> Enum.chunk_every(chunk_size, chunk_size, chunk_size, [])
    |> Enum.map(&String.slice(text, &1))
    |> Enum.map(&%Chunker.Chunk{text: &1})
  end
end

Next, we update Chunker.TextChunker to allow specification of the splitting strategy:

defmodule Chunker.TextChunker do
  alias Chunker.Splitters.{RecursiveSplit, SimpleDelimiterSplit, RegularExpressionSplit, FixedLengthSplit}
  alias Chunker.Chunk

  # extend default_opts to accept new splitting strategies as functions
  @default_opts [
    chunk_size: 2000,
    chunk_overlap: 200,
    strategy: RecursiveSplit,
    format: :plaintext
  ]

  @spec split(binary(), keyword()) :: [Chunk.t()]
  def split(text, opts \\ []) do
    opts = Keyword.merge(@default_opts, opts)
    strategy_module = opts[:strategy]
    # Call the split function of the given strategy module
    strategy_module.split(text, opts)
  end
end

Now, let's update the documentation of Chunker.TextChunker to instruct users on how to specify the splitting strategy:

  @doc """
  Splits the provided text into a list of `%Chunk{}` structs.

  ## Options
  * `:chunk_size` - Maximum size in code point length for each chunk.
  * `:chunk_overlap` - Number of overlapping code points between consecutive chunks to preserve context.
  * `:strategy` - A module that implements the `Chunker.SplitterBehaviour`. Use `Chunker.Splitters.RecursiveSplit`, `Chunker.Splitters.SimpleDelimiterSplit`, `Chunker.Splitters.RegularExpressionSplit`, or `Chunker.Splitters.FixedLengthSplit`.
  * `:format` - The format of the input text. Used to determine where to split the text in some strategies.
  * `:delimiter` - Required if using `SimpleDelimiterSplit`, the character used to split the text (e.g., "\n" for newline).
  * `:regex` - Required if using `RegularExpressionSplit`, the regular expression pattern to split the text.

  ## Examples

long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."

Default recursive splitting

Chunker.TextChunker.split(long_text)

Simple delimiter-based splitting

Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.SimpleDelimiterSplit, delimiter: ".")

Regular expression-based splitting

Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.RegularExpressionSplit, regex: ~r/[.!?]/)

Fixed length splitting

Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.FixedLengthSplit, chunk_size: 10)

Generates many smaller chunks with significant overlap Chunker.TextChunker.split(long_text, chunk_size: 10, chunk_overlap: 3, strategy: Chunker.Splitters.RecursiveSplit)

  """

Please note that, depending on the complexity of the split logic required, it may be necessary to further develop each new strategy module, making sure it handles edge cases, encoding issues, and any other requirements that fit the use cases your library is intended to support.

created by ross.garfield+demo@revelry.co using Prodops