run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.95k stars 5.3k forks source link

[Feature Request]: Better HTML Chunking #15974

Open KLGR123 opened 2 months ago

KLGR123 commented 2 months ago

Feature Description

Hi everyone, check this super amazing HTML chunking package :package: pip install html_chunking

Check out the html_chunking PYPI page and our Github page for more example DEMO!!

Reason

No response

Value of Feature

No response

dosubot[bot] commented 2 months ago

Hey, @KLGR123! I'm here to help you out with any bugs, questions, or even becoming a contributor. Let's tackle this issue together!

The get_html_chunks function in the html_chunking package offers a robust solution for chunking HTML content while preserving its structure and attributes. This is particularly useful for tasks requiring the full HTML context, such as web automation or navigation.

In contrast, LlamaIndex's HTMLNodeParser focuses on splitting text at the element level and adding metadata for each header relevant to the chunk. However, it primarily extracts text content and excludes the HTML structure, attributes, and other non-text elements, which may limit its use for tasks requiring the full HTML context [1][2].

For those interested in maintaining the full HTML context, the html_chunking package could be a valuable addition to your toolkit.

To continue talking to Dosu, mention @dosu.