hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
857 stars 174 forks source link

Trying to use other downloader_cls #12

Closed c4tz closed 7 years ago

c4tz commented 7 years ago

Hi there,

first of all: keep up the good work! I really like icrawler. :)

I'm currently trying to use the GoogleImageCrawler, but want to subsitute the downloader class. I tried overwriting it like this:

from icrawler import ImageDownloader

class AdvancedDownloader(ImageDownloader):
    def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
        print(task['file_url'])
        ImageDownloader.download(task, default_ext, timeout=5, max_retry=3, **kwargs)

calling it like this:

from icrawler.builtin import GoogleImageCrawler
from ExtendedDowloader import AdvancedDownloader

google_crawler = GoogleImageCrawler('/home/user/Downloads/test', downloader_cls=AdvancedDownloader)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
    date_min=None, date_max=None, feeder_thr_num=2,
    parser_thr_num=2, downloader_thr_num=8,
    min_size=(200,200), max_size=None)

Result:

Traceback (most recent call last):
  File "/home/user/projects/test.py", line 4, in <module>
    google_crawler = GoogleImageCrawler('/home/user/Downloads/test', downloader_cls=AdvancedDownloader)
  File "/usr/lib/python3.5/site-packages/icrawler/builtin/google.py", line 45, in __init__
    **kwargs)
TypeError: __init__() got multiple values for keyword argument 'downloader_cls'

I also tried overwriting the whole GoogleImageCrawler class, but the same Error came up.

So, how do I do it?

Background: I want to use multiple Crawlers (Bing, Baidu, Google) and want to check if I already downloaded the exact same URL (and maybe also check md5) to avoid duplicates.

hellock commented 7 years ago

Hi @BlkChockr , glad to hear that you like it. Inheriting from the GoogleImageCrawler class should be ok.

from icrawler import ImageDownloader, Crawler
from icrawler.builtin.google import GoogleImageCrawler, GoogleFeeder, GoogleParser

class AdvancedDownloader(ImageDownloader):
    def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
        print(task['file_url'])
        ImageDownloader.download(task, default_ext, timeou, max_retry, **kwargs)

class MyCrawler(GoogleImageCrawler):
    def __init__(self, *args, **kwargs):
        Crawler.__init__(
            feeder_cls=GoogleFeeder,
            parser_cls=GoogleParser,
            downloader_cls=AdvancedDownloader,
            *args,
            **kwargs)

Note that when initializing MyCrawler , you should use the base class Crawler instead of GoogleImageCrawler, because it does not receive the downloader_cls argument.

hellock commented 7 years ago

icrawler was updated to v0.3 yesterday with new features and re-implementation of some modules, so there is some api changes. For example you need to call it like this.

google_crawler = GoogleImageCrawler(storage={'root_dir': '/home/user/Downloads/test'},
                                    feeder_threads=2, parser_threads=2, downloader_threads=8)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
                     date_min=None, date_max=None,
                     min_size=(200,200), max_size=None)
c4tz commented 7 years ago

Thank you for the fast response! I just tested what you posted and I got the following Exception:

Exception in thread downloader-001:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.5/site-packages/icrawler/downloader.py", line 230, in worker_exec
    max_num, default_ext, queue_timeout, req_timeout, **kwargs)
  File "/usr/lib/python3.5/site-packages/icrawler/downloader.py", line 194, in worker_exec
    self.download(task, default_ext, req_timeout, **kwargs)
  File "/home/user/projects/ExtendedDowloader.py", line 10, in download
    ImageDownloader.download(task, default_ext, timeout, max_retry, **kwargs)
  File "/usr/lib/python3.5/site-packages/icrawler/downloader.py", line 106, in download
    file_url = task['file_url']
TypeError: string indices must be integers

Just to be sure, here are the files:

from icrawler import ImageDownloader, Crawler
from icrawler.builtin.google import GoogleImageCrawler, GoogleFeeder, GoogleParser

class AdvancedDownloader(ImageDownloader):

    def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
        print(task['file_url'])
        ImageDownloader.download(task, default_ext, timeout, max_retry, **kwargs)

class AdvancedImageCrawler(GoogleImageCrawler):
    def __init__(self, *args, **kwargs):
        Crawler.__init__(self,
            feeder_cls=GoogleFeeder,
            parser_cls=GoogleParser,
            downloader_cls=AdvancedDownloader,
            *args,
            **kwargs)
from icrawler.builtin import GoogleImageCrawler
from ExtendedDowloader import AdvancedImageCrawler

google_crawler = AdvancedImageCrawler(storage={'root_dir': '/home/user/Downloads/test'},
                                    feeder_threads=2, parser_threads=2, downloader_threads=8)
google_crawler.crawl(keyword='Duck', offset=0, max_num=100,
                     date_min=None, date_max=None,
                     min_size=(200,200), max_size=None)
hellock commented 7 years ago

Hi, the first argument self is missing. Changing the method AdvancedDownloader.download() to

def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
    print(task['file_url'])
    ImageDownloader.download(self, task, default_ext, timeout, max_retry, **kwargs)

or

def download(self, task, default_ext, timeout=5, max_retry=3, **kwargs):
    print(task['file_url'])
    super(AdvancedDownloader, self).download(task, default_ext, timeout, max_retry, **kwargs)

should work.

c4tz commented 7 years ago

Ah, I must be blind! Thank you again, it seems to work now, so feel free to close this issue :)