revelrylabs / text_chunker_ex

A library for semantically coherent text chunking
MIT License
50 stars 4 forks source link

Strategy: html #22

Open cpursley opened 2 months ago

cpursley commented 2 months ago

Are there any plans or interest in an html chunking strategy?

There's some ideas here: https://medium.com/unstructured-io/easy-web-scraping-and-chunking-by-document-elements-for-large-language-models-c45d13aca8dd

cpursley commented 1 month ago

Created a PR: https://github.com/revelrylabs/text_chunker_ex/pull/23

Please let me know how I can improve it.

cpursley commented 1 month ago

Just saw the metadata branch: https://github.com/revelrylabs/text_chunker_ex/tree/14-metadata-to-chunk

I think this could play well with that following this sort of strategy: https://blog.langchain.dev/a-chunk-by-any-other-name#Q+A-with-Structured-Chunking

stuartjohnpage commented 1 month ago

This is absolutely the kind of thing we want to support; thank you for your contribution!

That article is very interesting. I let that metadata branch go, because it seemed like the data I was adding during the splitting wasn't actually relevant to the splitting itself.

However, adding the name of any given HTML or markdown section to metadata per chunk might be a worthy cause. Hell, if we want to split on functions and modules, having that information in the chunk itself just sounds like more context for the chunk, which sounds great.

In the meantime, HTML splitters to split a given document according to its own explicit informational structure are much appreciated ❤️