scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.37k stars 298 forks source link

Fill author cannot handle 302 redirects #469

Closed thelondonsimon closed 1 year ago

thelondonsimon commented 1 year ago

Describe the bug When calling fill() on an author record whose scholar_id has a 302 redirect, scholarly gets stuck in a loop on the original URL.

To Reproduce

from scholarly import scholarly
import logging

logging.basicConfig(level=logging.INFO)

scholar_id = 'oMaIg8sAAAAJ'
author = scholarly.search_author_id(scholar_id, filled = True)

Results in logging such as:

INFO:scholarly:Getting https://scholar.google.com/citations?hl=en&user=oMaIg8sAAAAJ&pagesize=100
INFO:scholarly:Getting https://scholar.google.com/citations?hl=en&user=oMaIg8sAAAAJ&cstart=100&pagesize=100
INFO:scholarly:Getting https://scholar.google.com/citations?hl=en&user=oMaIg8sAAAAJ&cstart=200&pagesize=100
INFO:scholarly:Getting https://scholar.google.com/citations?hl=en&user=oMaIg8sAAAAJ&cstart=300&pagesize=100
INFO:scholarly:Getting https://scholar.google.com/citations?hl=en&user=oMaIg8sAAAAJ&cstart=400&pagesize=100
...

Expected behavior The 302 redirect will be observed and results will be the same as for searching for scholar_id = 'PEJ42J0AAAAJ'

Desktop (please complete the following information):

arunkannawadi commented 1 year ago

With no proxy and FreeProxies, I could not fetch this (see #465 ), but then when I tried with ScraperAPI, I got 404 (not found) response for scholar_id = 'oMaIg8sAAAAJ' and success with scholar_id = 'PEJ42J0AAAAJ'. It is true that the code does not handle 302 redirects, but I did not encounter this. Is this a consistent issue?

thelondonsimon commented 1 year ago

Yes, I found it was a recurring issue. I had a script iterating through a list of scholar_ids and if it encountered one which effectively had a 302 redirect (when viewed in a browser), it would get stuck in the kind of loop identified in the logging referenced in my original post.

arunkannawadi commented 1 year ago

Redirection should now be handled in the recent version (>= 1.7.8).

However, I'd still recommend to not give an outdated scholar_id because handling redirection is of limited use. Google Scholar has redirection only for the main author's page. scholarly constructs specific URLs from the given scholar_id to fill in all the relevant information, and they get a 404 response instead of 302.

For e.g., trying to get the publication information from the outdated ID would be https://scholar.google.com/citations?view_op=view_citation&hl=en&user=oMaIg8sAAAAJ&citation_for_view=oMaIg8sAAAAJ:M3ejUd6NZC8C which returns 404, whereas, https://scholar.google.com/citations?user=oMaIg8sAAAAJ&hl=en gets 302. I

arunkannawadi commented 1 year ago

OK! scholarly v1.7.11 is smart enough to update the scholar_id and allow all methods that would be normally allowed. I just learnt that scholar_id values that are close point to the same user. For e.g., https://scholar.google.com/citations?user=PEJ42J0AAAAR or https://scholar.google.com/citations?user=PEJ42J0AAABJ all point to the same profile. Google Scholar is just weird.

Anyhow, the redirection appears to work with no proxies and with FreeProxies, but not with ScraperAPI, despite turning on the relevant API parameters. However, this shouldn't be an issue since scholarly uses FreeProxies to fetch this information even if you have setup ScraperAPI (unless you use ScraperAPI as the secondary proxy as well, which is in general a bad idea).