Open rgarfield11 opened 4 months ago
To integrate multiple splitting strategies into the Text Chunker
library, we should define additional modules that implement the Chunker.SplitterBehaviour
. These new modules will be SimpleDelimiterSplit
, RegularExpressionSplit
, and FixedLengthSplit
. Let's first define these modules with basic implementations, then update the TextChunker
module to support strategy selection.
First, we define the new splitting strategies.
defmodule Chunker.Splitters.SimpleDelimiterSplit do
@behaviour Chunker.SplitterBehaviour
def split(text, opts) do
delimiter = opts[:delimiter] || "\n"
chunk_size = opts[:chunk_size]
text
|> String.split(delimiter)
|> Enum.map(&String.trim/1) # Trim the chunks
|> Enum.chunk_every(chunk_size, chunk_size, [], &Enum.join(&1, delimiter))
|> Enum.map(&%Chunker.Chunk{text: &1})
end
end
defmodule Chunker.Splitters.RegularExpressionSplit do
@behaviour Chunker.SplitterBehaviour
def split(text, opts) do
regex = opts[:regex] || ~r/\s+/
chunk_size = opts[:chunk_size]
text
|> Regex.split(regex)
|> Enum.map(&String.trim/1)
|> Enum.chunk_every(chunk_size, chunk_size, [], &Enum.join(&1, " "))
|> Enum.map(&%Chunker.Chunk{text: &1})
end
end
defmodule Chunker.Splitters.FixedLengthSplit do
@behaviour Chunker.SplitterBehaviour
def split(text, opts) do
chunk_size = opts[:chunk_size]
String.length(text)
|> Enum.chunk_every(chunk_size, chunk_size, chunk_size, [])
|> Enum.map(&String.slice(text, &1))
|> Enum.map(&%Chunker.Chunk{text: &1})
end
end
Next, we update Chunker.TextChunker
to allow specification of the splitting strategy:
defmodule Chunker.TextChunker do
alias Chunker.Splitters.{RecursiveSplit, SimpleDelimiterSplit, RegularExpressionSplit, FixedLengthSplit}
alias Chunker.Chunk
# extend default_opts to accept new splitting strategies as functions
@default_opts [
chunk_size: 2000,
chunk_overlap: 200,
strategy: RecursiveSplit,
format: :plaintext
]
@spec split(binary(), keyword()) :: [Chunk.t()]
def split(text, opts \\ []) do
opts = Keyword.merge(@default_opts, opts)
strategy_module = opts[:strategy]
# Call the split function of the given strategy module
strategy_module.split(text, opts)
end
end
Now, let's update the documentation of Chunker.TextChunker
to instruct users on how to specify the splitting strategy:
@doc """
Splits the provided text into a list of `%Chunk{}` structs.
## Options
* `:chunk_size` - Maximum size in code point length for each chunk.
* `:chunk_overlap` - Number of overlapping code points between consecutive chunks to preserve context.
* `:strategy` - A module that implements the `Chunker.SplitterBehaviour`. Use `Chunker.Splitters.RecursiveSplit`, `Chunker.Splitters.SimpleDelimiterSplit`, `Chunker.Splitters.RegularExpressionSplit`, or `Chunker.Splitters.FixedLengthSplit`.
* `:format` - The format of the input text. Used to determine where to split the text in some strategies.
* `:delimiter` - Required if using `SimpleDelimiterSplit`, the character used to split the text (e.g., "\n" for newline).
* `:regex` - Required if using `RegularExpressionSplit`, the regular expression pattern to split the text.
## Examples
long_text = "This is a very long text that needs to be split into smaller pieces for easier handling."
Chunker.TextChunker.split(long_text)
Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.SimpleDelimiterSplit, delimiter: ".")
Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.RegularExpressionSplit, regex: ~r/[.!?]/)
Chunker.TextChunker.split(long_text, strategy: Chunker.Splitters.FixedLengthSplit, chunk_size: 10)
Generates many smaller chunks with significant overlap Chunker.TextChunker.split(long_text, chunk_size: 10, chunk_overlap: 3, strategy: Chunker.Splitters.RecursiveSplit)
"""
Please note that, depending on the complexity of the split logic required, it may be necessary to further develop each new strategy module, making sure it handles edge cases, encoding issues, and any other requirements that fit the use cases your library is intended to support.
created by ross.garfield+demo@revelry.co using Prodops
Background
Currently, our text-splitting library 'Text Chunker' utilizes a single strategy for splitting text: Recursive Split. This method was adapted from LangChain, and while it's effective, it doesn't cover the various text-splitting needs for different use cases. To address this gap and provide more flexibility to our users, we plan to introduce alternative strategies for text splitting that can accommodate a wider range of text types and splitting requirements.
Acceptance Criteria
Scenario: Implement new text-splitting strategies
Given the “Text Chunker” library currently supports only "Recursive Split"
[ ] And documentation should be updated to instruct users on how to specify the splitting strategy and understand the implications of each strategy on their text-splitting needs
created by ross.garfield+demo@revelry.co using Prodops