Closed c4tz closed 7 years ago
Hi @BlkChockr , glad to hear that you like it. Inheriting from the GoogleImageCrawler
class should be ok.
from icrawler import ImageDownloader, Crawler
from icrawler.builtin.google import GoogleImageCrawler, GoogleFeeder, GoogleParser
class AdvancedDownloader(ImageDownloader):
def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
print(task['file_url'])
ImageDownloader.download(task, default_ext, timeou, max_retry, **kwargs)
class MyCrawler(GoogleImageCrawler):
def __init__(self, *args, **kwargs):
Crawler.__init__(
feeder_cls=GoogleFeeder,
parser_cls=GoogleParser,
downloader_cls=AdvancedDownloader,
*args,
**kwargs)
Note that when initializing MyCrawler
, you should use the base class Crawler
instead of GoogleImageCrawler
, because it does not receive the downloader_cls
argument.
icrawler
was updated to v0.3 yesterday with new features and re-implementation of some modules, so there is some api changes. For example you need to call it like this.
google_crawler = GoogleImageCrawler(storage={'root_dir': '/home/user/Downloads/test'},
feeder_threads=2, parser_threads=2, downloader_threads=8)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
date_min=None, date_max=None,
min_size=(200,200), max_size=None)
Thank you for the fast response! I just tested what you posted and I got the following Exception:
Exception in thread downloader-001:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/site-packages/icrawler/downloader.py", line 230, in worker_exec
max_num, default_ext, queue_timeout, req_timeout, **kwargs)
File "/usr/lib/python3.5/site-packages/icrawler/downloader.py", line 194, in worker_exec
self.download(task, default_ext, req_timeout, **kwargs)
File "/home/user/projects/ExtendedDowloader.py", line 10, in download
ImageDownloader.download(task, default_ext, timeout, max_retry, **kwargs)
File "/usr/lib/python3.5/site-packages/icrawler/downloader.py", line 106, in download
file_url = task['file_url']
TypeError: string indices must be integers
Just to be sure, here are the files:
from icrawler import ImageDownloader, Crawler
from icrawler.builtin.google import GoogleImageCrawler, GoogleFeeder, GoogleParser
class AdvancedDownloader(ImageDownloader):
def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
print(task['file_url'])
ImageDownloader.download(task, default_ext, timeout, max_retry, **kwargs)
class AdvancedImageCrawler(GoogleImageCrawler):
def __init__(self, *args, **kwargs):
Crawler.__init__(self,
feeder_cls=GoogleFeeder,
parser_cls=GoogleParser,
downloader_cls=AdvancedDownloader,
*args,
**kwargs)
from icrawler.builtin import GoogleImageCrawler
from ExtendedDowloader import AdvancedImageCrawler
google_crawler = AdvancedImageCrawler(storage={'root_dir': '/home/user/Downloads/test'},
feeder_threads=2, parser_threads=2, downloader_threads=8)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
date_min=None, date_max=None,
min_size=(200,200), max_size=None)
Hi, the first argument self
is missing. Changing the method AdvancedDownloader.download()
to
def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
print(task['file_url'])
ImageDownloader.download(self, task, default_ext, timeout, max_retry, **kwargs)
or
def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
print(task['file_url'])
super(AdvancedDownloader, self).download(task, default_ext, timeout, max_retry, **kwargs)
should work.
Ah, I must be blind! Thank you again, it seems to work now, so feel free to close this issue :)
Hi there,
first of all: keep up the good work! I really like icrawler. :)
I'm currently trying to use the
GoogleImageCrawler
, but want to subsitute the downloader class. I tried overwriting it like this:calling it like this:
Result:
I also tried overwriting the whole
GoogleImageCrawler
class, but the same Error came up.So, how do I do it?
Background: I want to use multiple Crawlers (Bing, Baidu, Google) and want to check if I already downloaded the exact same URL (and maybe also check md5) to avoid duplicates.