Open drale2k opened 1 year ago
@drale2k It does depend on how you'd like to chunk the document. Like does putting an <h1>Article Title</h1>
into a separate chunk on its own make sense?
I think HTML documents should be chunked just like any other document. Extract all text from <body>
and chunk based on chunk size. Not use HTML tags as a delimiter.
The current implementation will only get you the contents of heading and <p>
text. Everything outside those tags will be ignored.
The HTML processor is currently limited to only extract text from
h1 h2 h3 h4 h5 h6 p
. Given a HTML page this will miss a lot of content like divs, article, list and table tags.Would it not be better to just extract all text by passing the contents of
<body>
to Nokogiri, likeNokogiri::HTML(data).text