scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.29k stars 292 forks source link

AttributeError while fetching page #492

Open theogiraudet opened 1 year ago

theogiraudet commented 1 year ago

Describe the bug When I execute a query without proxy, I have an error at captcha resolution blocking the fetch.

Here is the error:

13:31:36 - Getting https://scholar.google.com/scholar?hl=en&q=Perception%20of%20physical%20stability%20and%20center%20of%20mass%20of%203D%20objects&as_vis=0&as_sdt=0,33
13:31:39 - Got a captcha request.
13:31:44 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:31:44 - Retrying with a new session.
13:31:54 - Got a captcha request.
13:32:01 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:01 - Retrying with a new session.
13:32:06 - Got a captcha request.
13:32:13 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:13 - Retrying with a new session.
13:32:18 - Got a captcha request.
13:32:25 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:25 - Retrying with a new session.
13:32:42 - Got a captcha request.
13:32:48 - Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",)
13:32:48 - Retrying with a new session.
Traceback (most recent call last):
  File "C:\Users\**\Google Scholar Scrapper\src\main.py", line 8, in <module>
    search_query = scholarly.search_pubs('Perception of physical stability and center of mass of 3D objects')
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_scholarly.py", line 160, in search_pubs
    return self.__nav.search_publications(url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_navigator.py", line 296, in search_publications
    return _SearchScholarIterator(self, url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\publication_parser.py", line 53, in __init__
    self._load_url(url)
  File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\publication_parser.py", line 59, in _load_url
    self._soup = self._nav._get_soup(url)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_navigator.py", line 239, in _get_soup
    html = self._get_page('https://scholar.google.com{0}'.format(url))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\**\Google Scholar Scrapper\venv\Lib\site-packages\scholarly\_navigator.py", line 190, in _get_page
    raise MaxTriesExceededException("Cannot Fetch from Google Scholar.")
scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.

Process finished with exit code 1

The issue seems to come from _proxy_generator.py#_handle_captcha2, line 403 where the cookie variable doesn't have the expecting value. This error isn't present when proxies are activated.

To Reproduce

import logging
from scholarly import scholarly

logging.basicConfig(format=f'%(asctime)s - %(message)s', level=logging.INFO, datefmt='%H:%M:%S')

search_query = scholarly.search_pubs('Perception of physical stability and center of mass of 3D objects')
scholarly.pprint(next(search_query))

Expected behavior The print of the first result of the query.

Desktop (please complete the following information):

Do you plan on contributing? Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

simon-20 commented 1 year ago

I am seeing this error too.

In _proxy_generator.py, the self._session object is an httpx.Client, and the cookies property on this client is a special Cookies store provided by httpx.

According to the httpx docs, there are no attributes for accessing the parts of a cookie directly:

In [1]: from httpx import Cookies
In [2]: cookies = Cookies()
In [3]: cookies.set("chocolate cookie", "tasty", domain="example.org")
In [4]: type(cookies['chocolate cookie'])
Out[4]: str
In [5]: cookies['chocolate cookie']
Out[5]: 'tasty'
In [6]: cookies['chocolate cookie'].domain
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 1
----> 1 cookies['chocolate cookie'].domain
AttributeError: 'str' object has no attribute 'domain'
In [7]: cookies['chocolate cookie'].value
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 cookies['chocolate cookie'].value
AttributeError: 'str' object has no attribute 'value'

What's strange is that if that's right, then the code on lines 405-413 could never have worked, which seems unlikely?

arunkannawadi commented 1 year ago

I am unable to reproduce this error, but the attributes from the httpx cookies certainly seem incorrect. These used to work on requests, but httpx doesn't seem to have the same behaviour.

simon-20 commented 1 year ago

Hi,

I see the error whenever a captcha is found. So for me, I have a program which invokes scholarly to try to get some abstracts from Google. After about 10-15 requests, even with random pauses in between, I encounter captcha, and this code block is entered, an exception thrown. The exception is caught, so the scholarly carries on trying to do what it was doing, though unsuccessfully.

haeggee commented 7 months ago

Reactivating this issue here because I encounter the same problem, and hoping someone with more expertise than me might be able to solve it. I investigated a little bit and it seems that the cookies variable -- at least in my case -- is just a string (e.g. 'NID' or 'GSP'). Hence the error Exception AttributeError while fetching page: ("'str' object has no attribute 'domain'",). Any help is very much appreciated! Thanks