[Feature Request]: Better HTML Chunking

Feature Description

Hi everyone, check this super amazing HTML chunking package :package: pip install html_chunking

Our HTML chunking algorithm operates through a well-structured process that involves several key stages, each tailored to efficiently chunk and merge HTML content while adhering to a token limit. This approach is highly suitable for scenarios where token limitations are critical, and the need for accurate HTML parsing is paramount, especially in tasks like web automation or navigation where HTML content serves as input.

For those of you who are interested in this, here's a demo

from html_chunking import get_html_chunks
merged_chunks = get_html_chunks(your_html_string_here, max_tokens=1000, is_clean_html=True, attr_cutoff_len=25)
merged_chunks

The output should consists of several HTML chunks, where each chunk contains valid HTML code with preserved structure and attributes (from root node all the way down to current node), and any excessively long attributes are truncated to the specified length.

Check out the html_chunking PYPI page and our Github page for more example DEMO!!

For those who are investigating the BEST way of chunking HTML for web automation or any other web agent tasks, you should definitely try html_chunking!!
LangChain (HTMLHeaderTextSplitter & HTMLSectionSplitter) and LlamaIndex (HTMLNodeParser) split text at the element level and add metadata for each header relevant to the chunk. However, they extract only the text content and exclude the HTML structure, attributes, and other non-text elements, limiting their use for tasks requiring the full HTML context.
Check our Github repo below and star :star2: https://github.com/KLGR123/html_chunking

Reason

No response

Value of Feature

No response

Hey, @KLGR123! I'm here to help you out with any bugs, questions, or even becoming a contributor. Let's tackle this issue together!

The get_html_chunks function in the html_chunking package offers a robust solution for chunking HTML content while preserving its structure and attributes. This is particularly useful for tasks requiring the full HTML context, such as web automation or navigation.

In contrast, LlamaIndex's HTMLNodeParser focuses on splitting text at the element level and adding metadata for each header relevant to the chunk. However, it primarily extracts text content and excludes the HTML structure, attributes, and other non-text elements, which may limit its use for tasks requiring the full HTML context ^[1]^[2].

For those interested in maintaining the full HTML context, the html_chunking package could be a valuable addition to your toolkit.

To continue talking to Dosu, mention @dosu.

run-llama / llama_index