philipperemy / amazon-reviews-scraper

Yet another multi language scraper for Amazon targeting reviews.
Apache License 2.0
118 stars 42 forks source link

[GUIDE] Bypass amazon detection each 5000 tries with change user agent method #11

Open pnthai88 opened 5 years ago

pnthai88 commented 5 years ago

Dear guys,

Thanks for sharing your code - Author, philipperemy. It's helpful for my data science hobby atm. Here is how to bypass detection of amazon

### In: core_utils.py: Import fake user agent

def get_soup_retry(url):
    from fake_useragent import UserAgent
    ua = UserAgent()
    UserAGR = ua.random
    if AMAZON_BASE_URL not in url:
        url = AMAZON_BASE_URL + url
    nap_time_sec = 1
    logging.debug('Script is going to sleep for {} (Amazon throttling). ZZZzzzZZZzz.'.format(nap_time_sec))
    sleep(nap_time_sec)

    header = {
        'User-Agent': UserAGR
    }
    logging.debug('-> to Amazon : {}'.format(url))
    isCaptcha = True
    while isCaptcha==True:
        out = requests.get(url, headers=header)
        assert out.status_code == 200
        soup = BeautifulSoup(out.content, 'lxml')
        if 'captcha' in str(soup):
            UserAGR = ua.random
            print('Bot has been detected... retrying ... use new identity: ', UserAGR)
            isCaptcha=True
        else:
            UserAGR = ua.random
            print('Bot bypassed')
            isCaptcha=False
            return soup

def get_soup(url):
    soup = get_soup_retry(url)
    return soup

Well it's simply go through with many tries :) Good luck!

philipperemy commented 5 years ago

Excellent! Happy it could work out well for you. I'm using ExpressVPN when it happens but it requires a subscription. Nice trick!

stefantrinh1 commented 4 years ago

I have tried this but seem to be getting

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)

and

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>

and

raise FakeUserAgentError('Maximum amount of retries reached')

fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached

philipperemy commented 4 years ago

@stefantrinh1 hum it does not sounds good. Check your internet connection and that everything is working properly. Run on python3, re-install your deps.