scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.29k stars 292 forks source link

search_pubs - StopIteration #522

Closed MatteoRiva95 closed 7 months ago

MatteoRiva95 commented 8 months ago

Describe the bug

After I run the code, I receive this error:


StopIteration Traceback (most recent call last) in <cell line: 8>() 6 7 search_query = scholarly.search_pubs("Advances in the diagnosis and treatment of small bowel lesions with Crohn's disease using double-balloon endoscopy") ----> 8 scholarly.pprint(next(search_query))

/usr/local/lib/python3.10/dist-packages/scholarly/publication_parser.py in next(self) 91 return self.next() 92 else: ---> 93 raise StopIteration 94 95 # Pickle protocol

StopIteration:

To Reproduce

from scholarly import scholarly, ProxyGenerator

pg = ProxyGenerator() success = pg.FreeProxies() scholarly.use_proxy(pg)

search_query = scholarly.search_pubs("Advances in the diagnosis and treatment of small bowel lesions with Crohn's disease using double-balloon endoscopy") scholarly.pprint(next(search_query))

Expected behavior

I would like to give to Scholarly a title and then to have the url for the PDF as return, please.

Desktop (please complete the following information):

Thank you in advance!

Owaiskhan9654 commented 7 months ago

Any update? Still getting this issue @ipeirotis @papr @marcoscarpetta @guicho271828

image
ipeirotis commented 7 months ago

You need a paid proxy for searching publications.

jmoraispk commented 3 months ago

@Owaiskhan9654 , that was not the issue that OP @MatteoRiva95 posted.

The issue was with the exception StopIteration, which I don't think it has anything to do with Proxies.

First, consider this code:

from scholarly import scholarly

search_phrase = "massive MIMO"
search_query = scholarly.search_pubs(search_phrase)
search_query2 = scholarly.search_pubs(search_phrase, start_index=970)

You will get: search_query.total_results --> 179000 search_query2.total_results --> 0 (it's 0 even if start_index= 10)

This is issue 1.

Issue 2: When you iterate over the results using next(search_query). search_query2 raises THAT exception (StopIteration) after 10 results or so.

What's going on? Any idea @ipeirotis?

ipeirotis commented 3 months ago

Seems to be related to the anti-crawling mechanism of Google. The URL that we create seems to be flagged as unusual by Google and Google returns back an "error" page. I do not have a clear path how to fix this.