jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
6.73k stars 534 forks source link

Only non-relevant page components returned #20

Open RamXX opened 6 months ago

RamXX commented 6 months ago

Fantastic project. Thank you!

Here is a page that (one would think) is straightforward to parse: https://access.redhat.com/security/cve/CVE-2023-45853 . However, none of the relevant information in the page makes it to the parsed version, only the corporate links and "scaffolding".

I figured I'd report it in case this can highlight some areas of improvement. Thanks again!

hanxiao commented 6 months ago

Thanks for reporting, will dig in.

hanxiao commented 6 months ago
image

found the problem, somehow this site doesn't even work with chrome->view source code view-source:https://access.redhat.com/security/cve/CVE-2023-45853. because it requires js to be running,

so using stream mode solves the problem:

curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

pay attention to the last chunk in the event stream, it should give you:

image
Joelokon commented 6 months ago

Thank🙏

RamXX commented 6 months ago

Thanks a lot! I'll make a note whenever I can't parse a site, to attempt this mechanism. Wondering if we should keep this open basically to ensure it gets in the documentation. Otherwise we can just close it. Thanks again!