Refactor to separate chunkers from parsers so any parser can use any chunker

oxaroky02 commented 1 month ago

Note: It looks like a lot but the actual change isn't that complicated; however many files affected as I moved things around.

Background

Prior to this PR, we had document parsers that were based on one specific chunker; i.e. the PDF parser was based on the basic chunker while the text/md/docx parsers were based on recursive splitter, and neither could really use a different chunker.

Changes

Chonker is gone. 😄
Defined Chunkers module with ...
- Tweaked CHUNKING_METHODS as an array of methods each with an id and name.
- Added InputType enumeration (not Rails)
- Added #chunker_for(chunking_profile) to find and create a chunker based on the profile (and it's method)
Separated chunkers under the Chunkers module in the services/chunkers/ folder.
Two chunkers implemented as collaborators rather than mixins:
- BasicCharacterChunker which is the original regex-based splitter
- RecursiveTextChunker which is the recently added chunker that uses separator sets to assist with prioritized recursive splitting using the baran gem.
Each chunker supports #text(text, text_type) method so the parser can specify the input type when chunking
The parsers have been refactored to not mixin the chunking method, instead the base Text parser supports instantiating the appropriate chunker based on the document's chunking profile.
Add tests

Outcome

When ingesting a document into a collection, the user can select either chunking method for any of the supported document types.

oxaroky02 commented 1 month ago

Converting to draft ... I don't like how a chunker is being instantiated. Hold on.

oxaroky02 commented 1 month ago

Converting to draft ... I don't like how a chunker is being instantiated. Hold on.

Done. Simpler now too.

oxaroky02 commented 1 month ago

Update: added chunker tests which revealed a bug in how #chunk is called in the recursive splitter chunker. +1 for writing tests. 😄

oxaroky02 commented 1 month ago

Update: added parser tests and data files. (OK, I'm done now. 😄 )

nickthecook commented 1 month ago

LGTM, but I'll wait a bit in case you want to change chonker back.

nickthecook / archyve