jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
6.7k stars 532 forks source link

[Feature Request] Multiple target selectors #143

Open MaSchVam opened 2 weeks ago

MaSchVam commented 2 weeks ago

Some websites present a unique challenge due to their dynamic nature. Sophisticated/advanced ones often have varying classes and elements across different pages with similar URL schemes, and those utilizing A/B testing can further complicate the process. This makes it difficult to consistently access elements using a single target selector.

Proposal: I propose adding support for an array of target selectors in Reader. This feature would allow users to input multiple selectors, which Reader would then attempt sequentially. The process would stop and return the desired content as soon as a valid match is found. If none of the selectors are successful, Reader could return a 422 AssertionFailureError, maintaining its current behavior.

Not sure how this would play nice with the X-Wait-For selector, but it would prevent scenarios where you currently have to fire off a handful of Reader requests until you hit the selector that happened to be there.

FreddyAngelo commented 2 weeks ago

+1 on this 🙌🏻

nomagick commented 2 days ago

The current implementation should already support multiple target selectors.

This could be achieved by passing multiple X-Target-Selector headers. And it will automatically X-Wait-For ALL the selectors.

This has a slightly different behavior compared to passing a single selector, but including ,s inside the selector. Multiple selectors are each being waited for, but a single selector with multiple matches is being waited for as one clause.

Please also note that match all sectors, e.g., *:not(...), are ignored to prevent performance problems.