jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
6.75k stars 538 forks source link

When there are iframe tags on the page, the extracted content is the content of the iframe tags. Is there a way to handle the expected content that is not labeled with iframe tags? #59

Open fu1996 opened 5 months ago

fu1996 commented 5 months ago

url: https://r.jina.ai/https://new.qq.com/rain/a/20230723A067YG00

nomagick commented 5 months ago

It's not about iframe. It's the return timing. Our default return timing didn't work on this page.

To properly crawl this kind of webpage, you need to know about its structure. For this particular case, leverage our new x-target-selector header:

curl https://r.jina.ai/https://new.qq.com/rain/a/20230723A067YG00 -H 'x-target-selector: .content-article'