llmsherpha (from PDFs parsed using docker image) seems to be good at keeping tables in single chunks - however, other than that, it seems to be returning trivially small chunks.
These include:
Single characters (like a copyright symbol)
Small runs of characters like "**"
Single words
Single sentences
Unfortunately, each item from bulleted and numbered lists are each coming across as a separate chunk rather than having a single chunk with all list items included.
I'd expect related items to be in single chunks, and unrelated items to also be merged into larger chunks. (The sweet-spot seems to be about 1000 tokens) - and I don't see a way to tell the algorithm what the average chunk size and overlap should be when there are no heuristics applied that would otherwise determine valid semantic chunk boundaries.
llmsherpha (from PDFs parsed using docker image) seems to be good at keeping tables in single chunks - however, other than that, it seems to be returning trivially small chunks. These include:
Unfortunately, each item from bulleted and numbered lists are each coming across as a separate chunk rather than having a single chunk with all list items included.
I'd expect related items to be in single chunks, and unrelated items to also be merged into larger chunks. (The sweet-spot seems to be about 1000 tokens) - and I don't see a way to tell the algorithm what the average chunk size and overlap should be when there are no heuristics applied that would otherwise determine valid semantic chunk boundaries.