nickthecook / archyve

GNU Affero General Public License v3.0
79 stars 10 forks source link

Refactor to separate chunkers from parsers so any parser can use any chunker #52

Closed oxaroky02 closed 1 month ago

oxaroky02 commented 1 month ago

Note: It looks like a lot but the actual change isn't that complicated; however many files affected as I moved things around.

Background

Prior to this PR, we had document parsers that were based on one specific chunker; i.e. the PDF parser was based on the basic chunker while the text/md/docx parsers were based on recursive splitter, and neither could really use a different chunker.

Changes

Outcome

When ingesting a document into a collection, the user can select either chunking method for any of the supported document types.

oxaroky02 commented 1 month ago

Converting to draft ... I don't like how a chunker is being instantiated. Hold on.

oxaroky02 commented 1 month ago

Converting to draft ... I don't like how a chunker is being instantiated. Hold on.

Done. Simpler now too.

oxaroky02 commented 1 month ago

Update: added chunker tests which revealed a bug in how #chunk is called in the recursive splitter chunker. +1 for writing tests. ๐Ÿ˜„

oxaroky02 commented 1 month ago

Update: added parser tests and data files. (OK, I'm done now. ๐Ÿ˜„ )

nickthecook commented 1 month ago

LGTM, but I'll wait a bit in case you want to change chonker back.