hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
857 stars 174 forks source link

Error when using GoogleImageCrawler #40

Closed elliotchencv closed 6 years ago

elliotchencv commented 6 years ago

Hi icrawler developers, I've tried BingImageCrawler and GoogleImageCrawler. BingImageCrawler worked, however, The fowllowing sample code of GoogleImageCrawler didn't work. Can anyone help?

from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4,
                                    storage={'root_dir': 'dog'})
google_crawler.crawl(keyword='dog', max_num=1000,
                     date_min=None, date_max=None,
                     min_size=(10,10), max_size=None)

All I received was

2018-01-10 19:22:48,838 - INFO - icrawler.crawler - start crawling... 2018-01-10 19:22:48,839 - INFO - icrawler.crawler - starting 1 feeder threads... 2018-01-10 19:22:48,844 - INFO - feeder - thread feeder-001 exit 2018-01-10 19:22:48,844 - INFO - icrawler.crawler - starting 2 parser threads... 2018-01-10 19:22:48,849 - INFO - icrawler.crawler - starting 4 downloader threads... 2018-01-10 19:22:48,882 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 2 2018-01-10 19:22:48,913 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 1 2018-01-10 19:22:48,940 - ERROR - parser - Exception caught when fetching page https://www.google.com/search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr=, error: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=dog&ijn=0&start=0&tbs=cdr%3A1%2Ccd_min%3A%2Ccd_max%3A%2Csur%3A%2Citp%3A%2Cic%3A%2Cisc%3A&tbm=isch&lr= (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),)), remaining retry times: 0 2018-01-10 19:22:50,850 - INFO - parser - no more page urls for thread parser-002 to parse 2018-01-10 19:22:50,850 - INFO - parser - thread parser-002 exit 2018-01-10 19:22:50,943 - INFO - parser - no more page urls for thread parser-001 to parse 2018-01-10 19:22:50,943 - INFO - parser - thread parser-001 exit 2018-01-10 19:22:53,851 - INFO - downloader - no more download task for thread downloader-001 2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-002 2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-004 2018-01-10 19:22:53,852 - INFO - downloader - no more download task for thread downloader-003 2018-01-10 19:23:00,710 - INFO - downloader - thread downloader-001 exit 2018-01-10 19:23:00,714 - INFO - downloader - thread downloader-002 exit 2018-01-10 19:23:00,715 - INFO - downloader - thread downloader-004 exit 2018-01-10 19:23:00,715 - INFO - downloader - thread downloader-003 exit 2018-01-10 19:23:00,859 - INFO - icrawler.crawler - Crawling task done!

hellock commented 6 years ago

The certification validation problem is caused by requests. See stackoverflow for more info.

Here is a walkaround.

from icrawler.builtin import GoogleImageCrawler
google_crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4,
                                    storage={'root_dir': 'dog'})
google_crawler.session.verify = False
google_crawler.crawl(keyword='dog', max_num=1000,
                     date_min=None, date_max=None,
                     min_size=(10,10), max_size=None)
hellock commented 6 years ago

37

elliotchencv commented 6 years ago

Hi @hellock , your solution works. Thank you so much!