Closed oxaroky02 closed 1 month ago
Converting to draft ... I don't like how a chunker is being instantiated. Hold on.
Converting to draft ... I don't like how a chunker is being instantiated. Hold on.
Done. Simpler now too.
Update: added chunker tests which revealed a bug in how #chunk
is called in the recursive splitter chunker. +1 for writing tests. ๐
Update: added parser tests and data files. (OK, I'm done now. ๐ )
LGTM, but I'll wait a bit in case you want to change chonker
back.
Background
Prior to this PR, we had document parsers that were based on one specific chunker; i.e. the PDF parser was based on the basic chunker while the text/md/docx parsers were based on recursive splitter, and neither could really use a different chunker.
Changes
Chonker
is gone. ๐Chunkers
module with ...CHUNKING_METHODS
as an array of methods each with anid
andname
.InputType
enumeration (not Rails)#chunker_for(chunking_profile)
to find and create a chunker based on the profile (and it's method)Chunkers
module in theservices/chunkers/
folder.BasicCharacterChunker
which is the original regex-based splitterRecursiveTextChunker
which is the recently added chunker that uses separator sets to assist with prioritized recursive splitting using thebaran
gem.#text(text, text_type)
method so the parser can specify the input type when chunkingText
parser supports instantiating the appropriate chunker based on the document's chunking profile.Outcome
When ingesting a document into a collection, the user can select either chunking method for any of the supported document types.