jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
6.96k stars 549 forks source link

Feature request: x-scroll header #151

Open mb21 opened 1 week ago

mb21 commented 1 week ago

We already have the x-timeout header, which works for a lot of javascript-heavy websites. But some websites lazy-load certain things only when you scroll down a bit.

Therefore, I propose an x-scroll header, which would basically execute the following js after the page finished loading:

window.scrollTo({
  top: document.body.scrollHeight,
  behavior: "smooth",
})

(Pretty sure, 'smooth' scrolling triggers any IntersectionObservers in-between the top and the bottom of the page.)

And as soon as that's done and the event loop is empty, execute it again. As many times until either scrolling down doesn't expand the pages height anymore, or x-timeout is reached.

nomagick commented 3 days ago

We have introduced a script injection mechanism to our API. Also inside the page, we provide these utility functions/event:

- waitForSelector(selector: string): Promise<HTMLElement> 
  waits for the selector to appear in the DOM
- simulateScroll(): void 
  simulates scrolling to the bottom of the page to trigger lazyload elements
- "mutationIdle" event on document 
  fires when the DOM mutation is idle in 200ms

See https://github.com/jina-ai/reader/issues/150 for example

mb21 commented 2 days ago

Thanks! Seems curl ... --data-urlencode 'injectPageScript=document.addEventListener("mutationIdle", window.simulateScroll);' should indeed work for this, I'll give it a try. Feel free to close this issue then.