rohithpr / py-web-search

A Python module to fetch and parse results from different search engines.
MIT License
77 stars 26 forks source link

Google Search Limitation #16

Open dhondta opened 8 years ago

dhondta commented 8 years ago

Problem: By performing a lot of searches consecutively, Google detects the bot nature of the Python script and changes its responses into alternative pages with Captcha control.

Solution: While Google starts sending alternative pages, fall back to using Splinter library to perform browser automation with a Human-like behavior by spacing requests with a random wait timer. This is far slower but makes the script continue to work.

Note: If you are interested, let me know as I already implemented this solution. Note that this bypasses Google's control and this will certainly work during a limited time.

rohithpr commented 8 years ago

I've already set up a service that solves it. It waits for 3-10 seconds and then makes the request to Google to prevent blocking.

I'm considering an option to introduce wait right in this library to those who don't want to use search-api. At the moment it takes a bool value and waits for one second before making the request. It would be great to accept a number and wait for those many seconds so that a random behavior can be simulated. Of course, this would have to be backwards compatible.

Would you be interested in implementing this?

dhondta commented 8 years ago

In my use case, I want to avoid using search API's but I also want to perform (relatively) quick and numerous searches. That's why I faced the problem with Google detecting my bot script and then responding with a captcha-page. So, according to your solution, the problem is that :

I propose you the following solution : We adapt your library with the following features :

  1. Use a state variable in your class Google (as everything in Python is an object, this is very easy using a class variable)
  2. Parametrize the bounds of an interval for random.uniform
  3. Add a control at the beginning of *.search(...) in order to handle state
  4. Add exception handling *.scrape_search_result(soup) to avoid letting your script crash when the response is not formatted

This gives :

Edit : I mixed up code blocks in if 0 < Google.tc < Google.tcs: ... else: ... ; now, it's the correct version.

import random
import splinter
[...]

class Google:
    # state variable
    tc = 3 # Try Count
    # constants
    tcs = 3 # Try Count Start
    rlb = .5 # Random Lower Bound
    rub = 1.5 # Random Upper Bound

    @staticmethod
    def search(query, num=10, start=0, recent=None):   # here, we remove "sleep=True,"
        soup = None
        if 0 < Google.tc < Google.tcs:                 # perform a wait with exponential backoff from the first fail
            wait(2**(Google.tcs - Google.tc) * random.uniform(Google.rlb, Google.rub))
        elif Google.tc <= 0:                           # when try_count becomes null, try browser automation
            soup = Google.search_alternative(query, num, start, recent)

        # REMOVE "if sleep: wait(1)"

        url = ...
        soup = soup or BeautifulSoup(requests.get(url).text) # use soup instance computed by alternative method if relevant
        results = Google.scrape_search_result(soup) # this call causes an error when Google starts responding with
                                                    # captchas with exception handling in the related method,
                                                    # "results" now simply becomes None
        related_queries = [] if not results else Google.scrape_related(soup)  # not tested with alternative method

        raw_total_results = None if not results else soup.find('div', attrs = {'class' : 'sd'}).string
        [...]

        temp = ...
        if not results:
            Google.tc -= 1
        else:
            Google.tc = Google.tcs
        return temp

    @staticmethod
    def search_alternative(query, num=10, start=0, recent=None):
        browser = splinter.Browser()
        browser.visit('https://www.google.com')
        time.sleep(random.uniform(0.500, 1.500))
        browser.fill('q', query)
        button = browser.find_by_css('.lsb').first
        time.sleep(random.uniform(0.250, 1.000))
        button.click()
        time.sleep(.5)
        soup = BeautifulSoup(browser.html, "lxml")
        soup.find("div", {"class" : "rc"}).name = "li"
        soup.find("li", {"class" : "rc"})['class'] = "g"
        browser.quit()
        del browser
        return soup

    @staticmethod
    def scrape_search_result(soup):
        [...]
        try:
            [...]
        except [?] as e:
            [...]   # handle exception
            return  # then return None
rohithpr commented 8 years ago

This is what I suggest: Within pws:

if type(sleep) == int:
    wait(sleep)
elif sleep == True: # backward compatibility, people are using it by passing boolean values
    wait(1)

Calling the function: If you're making just one query: result = Google.search(query='foo', sleep=0) # 1 would be the default.

If you're making many queries result = Google.search(query='foo', sleep=10) # Or any number of seconds that you want to wait.

There is a major issue with using your method: I made a whole bunch of requests and got myself blocked, now irrespective of where I open Google (Firefox, requests or selenium) I get the same captcha response from Google. I'm not sure if you've solved it, please let me know if you have.

These are my recommendations:

  1. Rather than using search_alternative as a fallback, the scraper should just catch the error and report it to the user like {error: 'foo bar'} or something.
  2. search_alternative should be a method that the users can call only if they want to. (I couldn't figure out if a webdriver must be installed separately or if installing splinter is enough. I don't want to impose any external dependencies. Please let me know if you have knowledge about this.)
  3. Rather than maintaining state within the library we let the user specify the sleep duration while calling the search method as stated above.
dhondta commented 8 years ago

Sorry, the reason you received the response with the captcha is that I mixed two lines of code. This is now corrected. The effect is that fails will trigger a wait timer with exponential backoff before trying browser automation if it really does not work with a normal request.

So, if you want to retest...

dhondta commented 8 years ago

Anyway, this is of course your project, so I will not make you loose time with trying to convince you that it could be a better way to proceed. This is only a question of use case.

In my own use case, I need to trigger asynchronous tasks with Google/Bing/[whatever search engine] searches being sure that it will send back a result without taking care about the adaptation of a wait time. So, for my application, this should be transparently handled by the library as, e.g. :

So, I suppose your solution to use a sleep argument fits your need but not mine.

Note : Why don't you simply write wait(int(sleep)) (then casting sleep into an integer) instead of the block of 4 lines you mentioned ?

Regarding your enumerated list of remarks :

  1. That's indeed a solution, but it completely removes the feature of fallback function.
  2. If this function remains a fallback, one can simply add an argument fallback set to False by default to avoid using browser automation.
  3. Maintaining a state is aimed to provide the automated mechanism.

If the points related to these remarks are requirements for your project, then I understand that my proposition is not a solution for you.

About the dependencies, pip install selenium solves the problem of webdriver for Firefox. So, unfortunately, according to what you answered previously, splinter effectively requires a dependency to work (moreover, Firefox must be installed, of course).

dhondta commented 8 years ago

Regarding your requirements, your solution is surely what you need. Anyway, I have scripted the fallback function in my application and, with your solution, I can simply handle the exponential backoff myself through your sleep argument. So, no problem, my ideas were just suggestions.

Do you need collaboration for any further implementation ?

rohithpr commented 8 years ago

I'm just trying to keep things simple and satisfy a general issue.

In many cases there are students who are new to scraping and they'll be blocked for a long time if they don't back off and make rapid requests (and have a static IP). New comers will make thousands of requests every second, trust me, I've been there. So I'd like the library to make it clear to them that there are serious issues. By providing this option we are pretty much handing over a weapon and hoping that they don't shoot themselves. It's not a technical issue but rather an issue of making sure that people don't get into trouble.

So I'd really like it if you implemented this as an alternative to *.search. (and add a huge warning in the docs about it's usage!)

Maybe a shell function that is responsible for maintaining state and stuff. It determines whether to call search or search_alt. And the existing search itself doesn't end up doing a whole bunch of stuff. This shell function would also be responsible for making sure that the back off mechanism is in place irrespective of whether we're making a search query or a news query.

Sound good to you?

rohithpr commented 8 years ago
About the dependencies, pip install selenium solves the problem of webdriver for Firefox. So, unfortunately, according to what you answered previously, splinter effectively requires a dependency to work (moreover, Firefox must be installed, of course).

This is something that I'm hoping to avoid. The above mentioned shell function would be useful in this case too. We can import these packages only when required and let users perform regular search even if they don't have FF installed. So this would be something like an advanced option.