To integrate multiple splitting strategies into the Text Chunker library, we should define additional modules that implement the Chunker.SplitterBehaviour. These new modules will be SimpleDelimiterSplit, RegularExpressionSplit, and FixedLengthSplit. Let's first define these modules with basic implementations, then update the TextChunker module to support strategy selection.

First, we define the new splitting strategies.

defmodule Chunker.Splitters.SimpleDelimiterSplit do
  @behaviour Chunker.SplitterBehaviour

  def split(text, opts) do
    delimiter = opts[:delimiter] || "\n"
    chunk_size = opts[:chunk_size]
    text
    |> String.split(delimiter)
    |> Enum.map(&String.trim/1) # Trim the chunks
    |> Enum.chunk_every(chunk_size, chunk_size, [], &Enum.join(&1, delimiter))
    |> Enum.map(&%Chunker.Chunk{text: &1})
  end
end

defmodule Chunker.Splitters.RegularExpressionSplit do
  @behaviour Chunker.SplitterBehaviour

  def split(text, opts) do
    regex = opts[:regex] || ~r/\s+/
    chunk_size = opts[:chunk_size]
    text
    |> Regex.split(regex)
    |> Enum.map(&String.trim/1)
    |> Enum.chunk_every(chunk_size, chunk_size, [], &Enum.join(&1, " "))
    |> Enum.map(&%Chunker.Chunk{text: &1})
  end
end

defmodule Chunker.Splitters.FixedLengthSplit do
  @behaviour Chunker.SplitterBehaviour

  def split(text, opts) do
    chunk_size = opts[:chunk_size]
    String.length(text)
    |> Enum.chunk_every(chunk_size, chunk_size, chunk_size, [])
    |> Enum.map(&String.slice(text, &1))
    |> Enum.map(&%Chunker.Chunk{text: &1})
  end
end

Next, we update Chunker.TextChunker to allow specification of the splitting strategy:

defmodule Chunker.TextChunker do
  alias Chunker.Splitters.{RecursiveSplit, SimpleDelimiterSplit, RegularExpressionSplit, FixedLengthSplit}
  alias Chunker.Chunk

  # extend default_opts to accept new splitting strategies as functions
  @default_opts [
    chunk_size: 2000,
    chunk_overlap: 200,
    strategy: RecursiveSplit,
    format: :plaintext
  ]

  @spec split(binary(), keyword()) :: [Chunk.t()]
  def split(text, opts \\ []) do
    opts = Keyword.merge(@default_opts, opts)
    strategy_module = opts[:strategy]
    # Call the split function of the given strategy module
    strategy_module.split(text, opts)
  end
end

Now, let's update the documentation of Chunker.TextChunker to instruct users on how to specify the splitting strategy:

  @doc """
  Splits the provided text into a list of `%Chunk{}` structs.

  ## Options
  * `:chunk_size` - Maximum size in code point length for each chunk.
  * `:chunk_overlap` - Number of overlapping code points between consecutive chunks to preserve context.
  * `:strategy` - A module that implements the `Chunker.SplitterBehaviour`. Use `Chunker.Splitters.RecursiveSplit`, `Chunker.Splitters.SimpleDelimiterSplit`, `Chunker.Splitters.RegularExpressionSplit`, or `Chunker.Splitters.FixedLengthSplit`.
  * `:format` - The format of the input text. Used to determine where to split the text in some strategies.
  * `:delimiter` - Required if using `SimpleDelimiterSplit`, the character used to split the text (e.g., "\n" for newline).
  * `:regex` - Required if using `RegularExpressionSplit`, the regular expression pattern to split the text.

  ## Examples

long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."

Default recursive splitting

Chunker.TextChunker.split(long_text)

Simple delimiter-based splitting

Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.SimpleDelimiterSplit, delimiter: ".")

Regular expression-based splitting

Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.RegularExpressionSplit, regex: ~r/[.!?]/)

Fixed length splitting

Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.FixedLengthSplit, chunk_size: 10)

Generates many smaller chunks with significant overlap Chunker.TextChunker.split(long_text, chunk_size: 10, chunk_overlap: 3, strategy: Chunker.Splitters.RecursiveSplit)

"""

Please note that, depending on the complexity of the split logic required, it may be necessary to further develop each new strategy module, making sure it handles edge cases, encoding issues, and any other requirements that fit the use cases your library is intended to support.

created by ross.garfield+demo@revelry.co using Prodops

rgarfield11 / text_chunker_ex

Engineer - Text Chunker - Integrate Splitting Strategies #1

Background

Acceptance Criteria

Scenario: Implement new text-splitting strategies

Default recursive splitting

Simple delimiter-based splitting

Regular expression-based splitting

Fixed length splitting