practical-data-science / ecommercetools

EcommerceTools is a Python data science toolkit for ecommerce, marketing science, and technical SEO analysis and modelling and was created by Matt Clarke.
MIT License
242 stars 48 forks source link

response from _get_results(query) contains NoneType which leads to parsing Fail #35

Open stRudolph opened 1 year ago

stRudolph commented 1 year ago

Hi Matt,

trying to scrape from google, I followed your blogpost on 3 lines google scraping and got the following error:

AttributeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 results = seo.get_serps("stupid")
      2 print(results)
File c:\Users\stephan.rudolph\Coding\testenv\Lib\site-packages\ecommercetools\seo\google_search.py:144, in get_serps(query, output)
    133 """Return the first 10 Google search results for a given query.
    134 
    135 Args:
   (...)
    140     results (dict): Results of query.
    141 """
    143 response = _get_results(query)
--> 144 results = _parse_search_results(response)
    146 if results:
    147     if output == "dataframe":

File c:\Users\stephan.rudolph\Coding\testenv\Lib\site-packages\ecommercetools\seo\google_search.py:124, in _parse_search_results(response)
    118 output = []
    120 for result in results:
    121     item = {
    122         'title': result.find(css_identifier_title, first=True).text,
    123         'link': result.find(css_identifier_link, first=True).attrs['href'],
--> 124         'text': result.find(css_identifier_text, first=True).text
...
    125     }
    127     output.append(item)
    129 return output

AttributeError: 'NoneType' object has no attribute 'text'

then i tried your other blogpost scrape with python, which is not relying on the ecommercetools package, and followed it to the T. here is the interesting part:

results = google_search("stupid")
results

yields normal output, rerunning this (jupyter cell) with keyword

results = google_search("allergy")
results

yields

AttributeError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 results = google_search("allergy")
      2 results

Cell In[8], line 3, in google_search(query)
      1 def google_search(query):
      2     response = get_results(query)
----> 3     return parse_results(response)

Cell In[7], line 17, in parse_results(response)
     10 output = []
     12 for result in results:
     14     item = {
     15         'title': result.find(css_identifier_title, first=True).text,
     16         'link': result.find(css_identifier_link, first=True).attrs['href'],
---> 17         'text': result.find(css_identifier_text, first=True).text
     18     }
     20     output.append(item)
     22 return output

AttributeError: 'NoneType' object has no attribute 'text'

So sometimes, the result.find(css_identifier_text, first=True): yields True , but NoneType ?? I have no Idea, under which circumstances this NoneType arises, but the behavior is as follows: the seo.get_serps() from ecommercetools consistently throws the error, the "hand written" equivalent is keyword sensitive, e.g. "allergy" throws the error, "keyword sensitive" does not.