jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
7k stars 551 forks source link

Inconsistent results without specifying timeouts #109

Closed mobob closed 2 months ago

mobob commented 2 months ago

Might be related to this site (http://www DOT lafiestalatina DOT ca/), and text mode, but i get wildly inconsistent results when i don't specify a timeout. When i do, and its big, its pretty reliable.

ie:

curl -H "Authorization: Bearer <xxx>" -H "X-Return-Format: text" -H "X-No-Cache: true" https://r.jina.ai/http://www.lafiestalatina.ca/ > scratch/latina-notimeout5.txt

curl -H "Authorization: Bearer <xxx>" -H "X-Return-Format: text" -H "X-No-Cache: true" -H "X-Timeout: 30" https://r.jina.ai/http://www.lafiestalatina.ca/ > scratch/latina-30sto3.txt
(base) ➜  ls -l scratch/lat* | awk '{print $5, $9}'
5762 scratch/latina-30sto1.txt
5762 scratch/latina-30sto2.txt
5762 scratch/latina-30sto3.txt
40 scratch/latina-notimeout1.txt
40 scratch/latina-notimeout2.txt
40 scratch/latina-notimeout3.txt
40 scratch/latina-notimeout4.txt
40 scratch/latina-notimeout5.txt
40 scratch/latina1.txt
40 scratch/latina2.txt
5762 scratch/latina3.txt
4782 scratch/latina4.txt
40 scratch/latina5.txt
5762 scratch/latina6.txt
243 scratch/latina7.txt

The ones with no suffix were without a timeout too...

I scanned the code and nothing jumped up. Suffice to say, i'm specifying a timeout going forward, but let me know if i'm misusing or there is something up with what i'm doing! I couldn't find reference to a "default timeout then we return the so-far data".

nomagick commented 2 months ago

Because It's hard to determine when the page is really loaded for modern websites.

When there's no timeout explicitly specified, Reader will try to return ASAP. As soon as the page loads and appears to contain something useful, Reader would return right away. In many cases this captures the main content correctly while also minimizing delay. However, depending on the implementation of the website, this strategy might not always succeed.

It could be the website first load to contain some content like the Title and Description, before it continues to load the full detail. In such a scenario, Reader might only return with the first batch and miss the details.

When the user explicitly specifies a timeout, the strategy is a little different. Reader will wait for "networkidle0", instead of eagerly trying to return.