patterns-ai-core / langchainrb

Build LLM-powered applications in Ruby
https://rubydoc.info/gems/langchainrb
MIT License
1.33k stars 187 forks source link

Don't limit HTML processor to certain tags #319

Open drale2k opened 1 year ago

drale2k commented 1 year ago

The HTML processor is currently limited to only extract text from h1 h2 h3 h4 h5 h6 p. Given a HTML page this will miss a lot of content like divs, article, list and table tags.

Would it not be better to just extract all text by passing the contents of <body> to Nokogiri, like Nokogiri::HTML(data).text

andreibondarev commented 1 year ago

@drale2k It does depend on how you'd like to chunk the document. Like does putting an <h1>Article Title</h1> into a separate chunk on its own make sense?

drale2k commented 1 year ago

I think HTML documents should be chunked just like any other document. Extract all text from <body> and chunk based on chunk size. Not use HTML tags as a delimiter.

The current implementation will only get you the contents of heading and <p> text. Everything outside those tags will be ignored.