jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
7.07k stars 556 forks source link

Reader doesn't extract any content from this page even though its quite simple? #105

Open oscar-o-oneill opened 3 months ago

oscar-o-oneill commented 3 months ago

Hi, I love reader! It's so useful. I am playing around with it, and I noticed it isn't able to extract any content from this URL.

https://www.canada.ca/en/women-gender-equality/gender-based-violence/gender-based-violence-glossary.html

On navigating to the reader page for it, I just get this response:

Title: 

URL Source: https://www.canada.ca/en/women-gender-equality/gender-based-violence/gender-based-violence-glossary.html

Markdown Content:

What's going on? It's a fairly simple page.

hanxiao commented 3 months ago

okay this is weird, i get the same empty result; however if i use pageshot mode it does return the full webpage

could u look at it? @nomagick

01b4c4a07c62c025981af2d5e5deb419

oscar-o-oneill commented 3 months ago

Thanks, @hanxiao. Just wanted to bring this to your attention! I will keep following the thread and help out if I can.

mapleeit commented 3 months ago

Hi @oscar-o-oneill did you have same issues on other pages?

I found that it seems there is some trick in this specific webpage that makes the browser treat the webpage isn't fully loaded until encountering the Timeout, which is 30s in this case by default. But I'm still trying to identify what's the trick in the page makes this situation.

It would be helpful if you have more bad cases, so that I can find the common pattern

oscar-o-oneill commented 3 months ago

Hi @mapleeit, no, I have not found this issue on many other pages. Reader usually works really well!

I will definitely report any issues I may find with other web pages in the future.

Thank you for making Jina AI Reader.

nomagick commented 3 months ago

It looks like some kind of bot-prevention mechanism from the "edgesuite". It seems to be replacing the DOM contents in a fraction and making Reader capture its warning messages.