Closed oxaroky02 closed 4 months ago
These chunks are readable!
Before I dive in, is it expected that PDF files would still use simple word splitting?
Before I dive in, is it expected that PDF files would still use simple word splitting?
For this PR, yes. The PDF conversion yields text with ... oddness. I haven't really tested that with the new splitter, and I'd like more time for that.
baran
gem which implements character splitters based on Langchain's text splittersText
parser to use recursive text splittingCommonMark
parser derived fromText
with custom separator rules to account for markdown and embedded HTML tablesDocx
parser to subclassCommonMark
and tweak pandoc invocation to not word-wrap long lines.Parsers#parser_for
to selectCommonMark
parser for.md
files