Integrate jina.ai Reader for search and website content extraction

gururise commented 4 months ago

The jina.ai READER API has support for web search, and also returns the content of a webpage in an LLM Friendly format: https://jina.ai/reader

Using this single tool, we could not have to use playwright for extracting data from websites, or serp.ai for search.

nsarrazin commented 4 months ago

I think it would make sense to provide an abstraction layer around web extraction specifically. Playwright has been working great but requires extra steps to install which caused friction for some users.

We could support

Basic parsing of the returned HTML like we used to do before
Advanced parsing with playwright
External parsers like jina.ai or other similar tools

This would mirror the way we already support multiple search results providers.

Will try to come back to this later unless someone feels comfortable tackling it, just let me know in that case :rocket:

krakenftw commented 3 months ago

can i work on this?

huggingface / chat-ui

Integrate jina.ai Reader for search and website content extraction #1348