jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
7.02k stars 554 forks source link

Incorrect loading of iframes + incorrect parse / converting tables #150

Open ntrippar opened 3 weeks ago

ntrippar commented 3 weeks ago

If we try too parse let say a medium post is not showing the gist code, I believe that this is because the iframes are not loaded instantly and you need to navigate thought the site to the browser to render them. One solution for this could be detecting every iframe on the site and scroll to each of them so they load.


curl 'https://r.jina.ai/https://edoconti.medium.com/offline-policy-evaluation-run-fewer-better-a-b-tests-60ce8f93fa15' \
    -H "Authorization: Bearer TOKEN” \
    -H "X-No-Cache: true" \
    -H "X-Timeout: 60" \
    -H "X-With-Iframe: true"

also checking the parser itself for the gist code for example in the case of the url above the iframe of some of the gist will be https://edoconti.medium.com/media/f10f007fac2ec7a4c0662ac12428a7fe

curl 'https://r.jina.ai/https://edoconti.medium.com/media/f10f007fac2ec7a4c0662ac12428a7fe' \
    -H "Authorization: Bearer TOKEN” \
    -H "X-No-Cache: true" \
    -H "X-Timeout: 60" \
    -H "X-With-Iframe: true"

it parses incorrectly and return the

html tags



Title: sample-push-notification-policy.py – Medium

URL Source: https://edoconti.medium.com/media/f10f007fac2ec7a4c0662ac12428a7fe

Markdown Content:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. [Learn more about bidirectional Unicode characters](https://github.co/hiddenchars)

[Show hidden characters](https://edoconti.medium.com/media/%7B%7BrevealButtonHref%7D%7D)

<table data-hpc="" data-tab-size="8" data-paste-markdown-skip="" data-tagsearch-lang="Python" data-tagsearch-path="sample-push-notification-policy.py"><tbody><tr><td id="file-sample-push-notification-policy-py-L1" data-line-number="1"></td><td id="file-sample-push-notification-policy-py-LC1"><span>def</span> <span>get_push_send_probabilities</span>(<span>context</span>):</td></tr><tr><td id="file-sample-push-notification-policy-py-L2" data-line-number="2"></td><td id="file-sample-push-notification-policy-py-LC2"><span>epsilon</span> <span>=</span> <span>0.10</span></td></tr><tr><td id="file-sample-push-notification-policy-py-L3" data-line-number="3"></td><td id="file-sample-push-notification-policy-py-LC3"></td></tr><tr><td id="file-sample-push-notification-policy-py-L4" data-line-number="4"></td><td id="file-sample-push-notification-policy-py-LC4"><span>if</span> <span>context</span>[<span>"days_since_app_open"</span>] <span>&gt;</span> <span>1</span>:</td></tr><tr><td id="file-sample-push-notification-policy-py-L5" data-line-number="5"></td><td id="file-sample-push-notification-policy-py-LC5"><span>return</span> {<span>"send"</span>: <span>1</span> <span>-</span> <span>epsilon</span>, <span>"dont_send"</span>: <span>epsilon</span>}</td></tr><tr><td id="file-sample-push-notification-policy-py-L6" data-line-number="6"></td><td id="file-sample-push-notification-policy-py-LC6"></td></tr><tr><td id="file-sample-push-notification-policy-py-L7" data-line-number="7"></td><td id="file-sample-push-notification-policy-py-LC7"><span>return</span> {<span>"send"</span>: <span>epsilon</span>, <span>"dont_send"</span>: <span>1</span> <span>-</span> <span>epsilon</span>}</td></tr></tbody></table>
nomagick commented 1 week ago

Resources on this page are lazy-loaded. Also, the gist iframes use table for layout, so it cannot be transformed into a code block or to a typical markdown table.

We have introduced a script injection mechanism to our API. Also inside the page, we provide these utility functions/event:

- waitForSelector(selector: string): Promise<HTMLElement> 
  waits for the selector to appear in the DOM
- simulateScroll(): void 
  simulates scrolling to the bottom of the page to trigger lazyload elements
- "mutationIdle" event on document 
  fires when the DOM mutation is idle in 200ms

For the gist formatting, we introduced a x-with-iframe: quoted parameter to inject iframe contents as blockquote sections.

So eventually use this script for your URL:

curl 'https://r.jina.ai/https://edoconti.medium.com/offline-policy-evaluation-run-fewer-better-a-b-tests-60ce8f93fa15' \
  -H 'x-with-iframe: quoted' \
  -H 'x-timeout: 60' \
  --data-urlencode 'injectPageScript=document.addEventListener("mutationIdle", window.simulateScroll);'

The new parameters are yet to be documented but are already usable.