Trivially small chunks returned

llmsherpha (from PDFs parsed using docker image) seems to be good at keeping tables in single chunks - however, other than that, it seems to be returning trivially small chunks. These include:

Single characters (like a copyright symbol)
Small runs of characters like "**"
Single words
Single sentences

Unfortunately, each item from bulleted and numbered lists are each coming across as a separate chunk rather than having a single chunk with all list items included.

I'd expect related items to be in single chunks, and unrelated items to also be merged into larger chunks. (The sweet-spot seems to be about 1000 tokens) - and I don't see a way to tell the algorithm what the average chunk size and overlap should be when there are no heuristics applied that would otherwise determine valid semantic chunk boundaries.

nlmatics / nlm-ingestor

Trivially small chunks returned #51