jina-ai / reader

Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
https://jina.ai/reader
Apache License 2.0
6.88k stars 543 forks source link

ResearchGate PDF links return empty content for most times #106

Closed phunterlau closed 2 months ago

phunterlau commented 2 months ago

It would be very helpful if research gate's PDF can be accessed by Reader API. Thank you.   Example links

Research gate: https://www.researchgate.net/profile/Luca-Mertens-2/publication/382994225_Model-Based_Reinforcement_Learning_Approaches_in_the_Low-Data-Regime/links/66b6284e51aa0775f2779ac0/Model-Based-Reinforcement-Learning-Approaches-in-the-Low-Data-Regime.pdf

phunterlau commented 2 months ago

The same with IEEE free PDFs like https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9275593 Thanks.

mapleeit commented 2 months ago

Hi @phunterlau

The url from research gate works for me, what's the result on your side?

For https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9275593, you can use POST mode. Screenshot shown below.

Screenshot 2024-08-19 at 16 24 40
mapleeit commented 2 months ago

One suggestion, for this webpage https://www.researchgate.net/publication/382994225_Model-Based_Reinforcement_Learning_Approaches_in_the_Low-Data-Regime

If you only want the pdf content, you can enable the Target Selector option, and set value as #pdf-html-reader (this value varies among the websites, it's only for this website) to avoid the ads and other meaningless content.

Screenshot 2024-08-19 at 17 28 18
phunterlau commented 2 months ago

Hi @phunterlau

The url from research gate works for me, what's the result on your side?

For https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9275593, you can use POST mode. Screenshot shown below.

Screenshot 2024-08-19 at 16 24 40

Big thanks. The researchgate link shows either empty content or "verify you are a human" content.

mapleeit commented 2 months ago

The researchgate link shows either empty content or "verify you are a human" content.

If the website has anti-crawler mechnism, then unfortunatelly there is nothing we can do. : (

phunterlau commented 2 months ago

Thanks. Do we support any parameters to pass refer links to reader? Some URLs check its source of refer like the following. If we remove -e the site returns error page

curl -L -e "https://scholar.google.com/" \
"https://www.academia.edu/download/51627580/A_Generalized_Reinforcement-Learning_Mod20170203-31871-44ae37.pdf?hl=en&sa=T&oi=ggp&ct=res&cd=13&d=7556059458401941712&ei=IcXCZpzRGPCz6rQP5oSLmQM" \
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" \
-o "A_Generalized_Reinforcement-Learning_Model.pdf"
mapleeit commented 2 months ago

It's supported now.

You can either pass referer in POST request body

curl --location 'https://r.jina.ai' \
--header 'Content-Type: application/json' \
--data '{
    "url": "https://www.academia.edu/download/51627580/A_Generalized_Reinforcement-Learning_Mod20170203-31871-44ae37.pdf?hl=en&sa=T&oi=ggp&ct=res&cd=13&d=7556059458401941712&ei=IcXCZpzRGPCz6rQP5oSLmQM",
    "referer": "https://scholar.google.com/",
    "noCache": true
}'

or pass it in the GET request header

curl --location 'https://r.jina.ai/https://www.academia.edu/download/51627580/A_Generalized_Reinforcement-Learning_Mod20170203-31871-44ae37.pdf?hl=en&sa=T&oi=ggp&ct=res&cd=13&d=7556059458401941712&ei=IcXCZpzRGPCz6rQP5oSLmQM' \
--header 'X-Referer: https://scholar.google.com' \
--header 'X-No-Cache: true'