Open KLGR123 opened 2 months ago
Hey, @KLGR123! I'm here to help you out with any bugs, questions, or even becoming a contributor. Let's tackle this issue together!
The get_html_chunks
function in the html_chunking
package offers a robust solution for chunking HTML content while preserving its structure and attributes. This is particularly useful for tasks requiring the full HTML context, such as web automation or navigation.
In contrast, LlamaIndex's HTMLNodeParser
focuses on splitting text at the element level and adding metadata for each header relevant to the chunk. However, it primarily extracts text content and excludes the HTML structure, attributes, and other non-text elements, which may limit its use for tasks requiring the full HTML context [1][2].
For those interested in maintaining the full HTML context, the html_chunking
package could be a valuable addition to your toolkit.
To continue talking to Dosu, mention @dosu.
Feature Description
Hi everyone, check this super amazing HTML chunking package :package:
pip install html_chunking
Our HTML chunking algorithm operates through a well-structured process that involves several key stages, each tailored to efficiently chunk and merge HTML content while adhering to a token limit. This approach is highly suitable for scenarios where token limitations are critical, and the need for accurate HTML parsing is paramount, especially in tasks like web automation or navigation where HTML content serves as input.
For those of you who are interested in this, here's a demo
The output should consists of several HTML chunks, where each chunk contains valid HTML code with preserved structure and attributes (from root node all the way down to current node), and any excessively long attributes are truncated to the specified length.
Check out the html_chunking PYPI page and our Github page for more example DEMO!!
For those who are investigating the BEST way of chunking HTML for web automation or any other web agent tasks, you should definitely try
html_chunking
!!LangChain (HTMLHeaderTextSplitter & HTMLSectionSplitter) and LlamaIndex (HTMLNodeParser) split text at the element level and add metadata for each header relevant to the chunk. However, they extract only the text content and exclude the HTML structure, attributes, and other non-text elements, limiting their use for tasks requiring the full HTML context.
Check our Github repo below and star :star2: https://github.com/KLGR123/html_chunking
Reason
No response
Value of Feature
No response