scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.29k stars 292 forks source link

MaxTriesExceededException: Cannot fetch from Google Scholar with free proxies #500

Closed kirk86 closed 1 year ago

kirk86 commented 1 year ago

Describe the bug Scholarly cannot fetch from google scholar with free proxies.

To Reproduce A minimal code snippet to reproduce the bug. If the bug is due to proxy issues and not exactly reproducible, please provide a code snippet and provide additional information under Additional context below.

Expected behavior I was under the impression that proxies would resolve issues with rate limits?

Screenshots If applicable, add screenshots to help explain your problem.

Traceback (most recent call last):
  File "fetch_citation.py", line 36, in <module>
    search_query = scholarly.search_pubs(title)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/lib/python3.11/site-packages/scholarly/_scholarly.py", line 160, in search_pubs
    return self.__nav.search_publications(url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/lib/python3.11/site-packages/scholarly/_navigator.py", line 296, in search_publications
    return _SearchScholarIterator(self, url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/lib/python3.11/site-packages/scholarly/publication_parser.py", line 53, in __init__
    self._load_url(url)
  File "/miniconda3/envs/lib/python3.11/site-packages/scholarly/publication_parser.py", line 59, in _load_url
    self._soup = self._nav._get_soup(url)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/lib/python3.11/site-packages/scholarly/_navigator.py", line 239, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/lib/python3.11/site-packages/scholarly/_navigator.py", line 190, in _get_page
    raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.

Desktop (please complete the following information):

Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

Additional context I'm not sure but suppose there's an issue with proxies

from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
pg.FreeProxies()
scholarly.use_proxy(pg)

search_query = scholarly.search_pubs(title)
scholarly.pprint(next(search_query))
arunkannawadi commented 1 year ago

Using FreeProxies is not guaranteed to solve this issue. It mainly protects your own IP from getting exposed and flagged by Google Scholar. This is likely to run again after some time, or by switching to more robust ones like ScraperAPI (they have a free plan).

kirk86 commented 1 year ago

Thanks, just a quick clarifying question. Suppose you have multiple queries to send to gscholar (e.g. in a batch loop), do we need to initialize a new ProxyGenerator for each query or does it suffice to init just once?

Another way to ask the same question, does the ProxyGenerator change the proxy ip on every call of scholarly.search_pubs() or is it using the same proxy ip every time scholarly.search_pubs() is invoked?

This is likely to run again after some time, or by switching to more robust ones like ScraperAPI

Can this be used without time rate interuption/exception from gscholar on a batch of queries?

abubelinha commented 1 year ago

I have the same questions as @kirk86 Also I am wondering about the recommended way of handling that MaxTriesExceededException. @arunkannawadi as you closed the issue, I understand you suggest that to be done on each user's script? In that case it would be great to see a minimal example in the docs.

But instead, wouldn't it be possible that scholarly itself handles it in order to change the proxy when the exception is raised? (although this is related to what @kirk86 already asked about how ProxyGenerator works)

Thanks a lot in advance for your answers @abubelinha

kirk86 commented 1 year ago

I've checked the ScraperAPI and it works great no interruptions/exceptions what so ever. The question here is what are they doing that works so great, that cannot be replicated using FreeProxies() or some other approach?

abubelinha commented 1 year ago

Yes, ScraperAPI works but I was more interested in a different question: Using FreeProxies() alone fails a lot ... but sometimes it works.

I discovered it by just rerunning my scholarly-FreeProxies script so many times. There are (few) chances that one of those runs gets a good proxy, and it starts scrapping. If not, then you'll usually get the MaxTriesExceededException (although I recall having got some other different messages which were not unhandled exceptions)

So the point is: If you are not rushy and decide not to use scraperapi and prefer your script to keep on trying FreeProxies until it works ... shouldn't scholarly itself detect when that MaxTriesExceededException happens, and then rotate proxy until it gets a different one which works?

I would suggest to keep open issue until that happens. Otherwise, I don't see the point of having a FreeProxies() option available at all.

BTW I think this issue is just a duplicate of https://github.com/scholarly-python-package/scholarly/issues/465 @arunkannawadi which one should we keep using? As per its title, I think this one more relevant to my let scholarly keep on trying FreeProxies() idea.

Thanks

PS - FWIW, I think updating fake_useragent was not relevant for this issue. The few times I got a successful FreeProxies() scraping run, I still had fake-useragent 0.1.11 That was before I read #465 and upgraded fake_useragent as suggested there (now I have fake-useragent 1.1.3 and keep on getting MaxTriesExceededException).

kirk86 commented 1 year ago

Using FreeProxies() alone fails a lot ... but sometimes it works. There are (few) chances that one of those runs gets a good proxy, and it starts scrapping.

If you are not rushy and decide not to use scraperapi and prefer your script to keep on trying FreeProxies until it works ... shouldn't scholarly itself detect when that MaxTriesExceededException happens, and then rotate proxy until it gets a different one which works?

Basically yes, there are plenty of free proxies and some are more reliable than others, I think the MaxTriesExceededException happens regardless of good or bad proxy (I might be wrong though). As stated proxies serve the purpose of not having your ip banned from gscholar.

So, in scholarly when using free proxies, our scripts pick an available proxy and send requests to gscholar, once the max number of requests is reached then gscholar blocks the proxies's ip. So, in the next run of our script scholarly has to be smart using flag to indicate which free proxies are currently banned and which are not, then pick from those who are not banned. But, this creates another issue which is you need an infinite number of proxies to keep rotating, with a finite amount most of them will be quickly banned given the number of large users that rely on scholarly.

Hence the question, does ScraperAPI implement the same logic? I highly doubt it, I think they are implementing something different, the question is what is that and why it works so robustly, because it allows to do continuous queries to gscholar without being banned at all.

abubelinha commented 1 year ago

Yes you are right.
I am a total noob but I'd bet the answer is:

This is impossible to emulate using FreeProxies and there's nothing scholarly can do about that, because you use those IPs from your machine, not passing through any centralised provider which controls flooding risks over a given server (GS or whatever).

I think handling exceptions coming from the internally used FreeProxies library in order to rotate/retry again (until you are lucky to get a GS-not-yet-banned proxy) would be something possible to do. Or at least introducing some options on how to handle that.

Maybe @jundymek (FreeProxie) can comment and/or suggest workarounds to scholarly developers?

arunkannawadi commented 1 year ago

Suppose you have multiple queries to send to gscholar (e.g. in a batch loop), do we need to initialize a new ProxyGenerator for each query or does it suffice to init just once?

It is sufficient to initialize the ProxyGenerator once. If it's set to use FreeProxies once the current proxy stops working, it'll look for the next working proxy automatically without you having to re-initialize it.

Can [ScraperAPI] be used without time rate interuption/exception from gscholar on a batch of queries?

Yes, this is lot better than using FreeProxies. But you'll need to get on to their subscription plan to do any large queries, since you'll run out of the free credits they offer pretty quickly.

Also I am wondering about the recommended way of handling that MaxTriesExceededException. @arunkannawadi as you closed the issue, I understand you suggest that to be done on each user's script? In that case it would be great to see a minimal example in the docs. But instead, wouldn't it be possible that scholarly itself handles it in order to change the proxy when the exception is raised?

Good point, I hadn't realized there's no mention of it in the documentation. MaxTriesExceededException is raised by scholarly, so there's no point in making scholarly handle that. That is a sign that scholarly is giving up in querying the GS pages. If you don't want to give up so quickly and want scholarly to keep working, then set longer intervals for timeout and wait_time when calling pg.FreeProxies(timeout=10, wait_time=1200), for example. This would wait for 20 minutes (1200 seconds) for new free proxies to show up before giving up.

in the next run of our script scholarly has to be smart using flag to indicate which free proxies are currently banned and which are not, then pick from those who are not banned

Yes, scholarly keeps track of that as long as the python session does not get terminated.

ScraperAPI probably uses private proxies, and they probably have quite a big number of them.

That's my understanding as well. They have dedicated machines and networks to do this, which is why they are a paid service.

abubelinha commented 1 year ago

f you don't want to give up so quickly and want scholarly to keep working, then set longer intervals for timeout and wait_time when calling pg.FreeProxies(timeout=10, wait_time=1200), for example. This would wait for 20 minutes (1200 seconds) for new free proxies to show up before giving up.

Thanks @arunkannawadi , that is what I'd like to do. Personally I am not rushy about getting results, so I don't mind having a .py script running in background the whole day while I do other stuff. Is it possible to set wait_time to "no limits", in order let the script work forever until it reaches the end of its programmed queries? In my use case, I already have downloaded (using scraperapi) a file with ~1K "unfilled publications". Now I just want to loop over them to fill in their detailed information (using FreeProxies, since I am not urged).

Also, what does timeout=10 mean in your example? Is it the max time scholarly will wait for receiving Google server answer before giving up that particular query? If so, what happens next? Would scholarly handle the situation and repeat the same query again until it gets an answer from Google? Or would it raise an exception and end script?