scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.29k stars 292 forks source link

Raising StopIteration Errors for some queries even when the http requests are successful (using ScraperAPI). #508

Open EthanC111 opened 11 months ago

EthanC111 commented 11 months ago

Describe the bug Both of the queries provided below will throw StopIteration errors even when the http requests are successful.

To Reproduce

import logging
import sys

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[
        logging.StreamHandler(sys.stdout),
    ],
)

from scholarly import scholarly
from scholarly import ProxyGenerator

scraper_api_key = "YOUR_SCRAPER_API_KEY"
# query = "A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation"
query = "Reducing the Dimensionality of Data with Neural Networks."

pg = ProxyGenerator()
success = pg.ScraperAPI(scraper_api_key)
scholarly.use_proxy(pg)
results = scholarly.search_pubs(query)
paper_info = next(results)
print(paper_info)

Expected behavior Should be printing the paper information.

Screenshots

scholary_bug

Desktop (please complete the following information):

Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

Additional context Add any other context about the problem here.

ronny3 commented 10 months ago

I believe this is when the result is the new google scholar UI that came in this June or so. It happens when it's a single result most of the time. You can try this in publication_parser.py. Add this to line 61. + self._soup.find_all('div', class_='gs_r gs_or gs_scl gs_fmar')

kostrykin commented 10 months ago

I believe this is when the result is the new google scholar UI that came in this June or so. It happens when it's a single result most of the time. You can try this in publication_parser.py. Add this to line 61. + self._soup.find_all('div', class_='gs_r gs_or gs_scl gs_fmar')

Thanks for pointing this out, just to be clear, the full line 61 should be changed from

self._rows = self._soup.find_all('div', class_='gs_r gs_or gs_scl') + self._soup.find_all('div', class_='gsc_mpat_ttl')

to

self._rows = self._soup.find_all('div', class_='gs_r gs_or gs_scl gs_fmar') + self._soup.find_all('div', class_='gsc_mpat_ttl')

Then it works.

gdudek commented 2 months ago

Was seeing intermittent failures again in April 2024. Needed to update that line (around 61) to be: self._rows = self._soup.find_all('div', class_='gs_r gs_or gs_scl gs_fmar') + self._soup.find_all('div', class_='gsc_mpat_ttl') + self._soup.find_all('div', class_='gs_r gs_or gs_scl')