hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
857 stars 174 forks source link

how to save url in json file / not downloads image file #56

Closed C-YooJin closed 5 years ago

C-YooJin commented 5 years ago

Hello! Thank you very much for making the icrawler. I'm using the library in useful ways. :) I have a question. I don't want to download images directly, I just want to get the url of images. I programmed using reference to the #34 that were uploaded last time. This is my code.

import base64
from collections import OrderedDict

from icrawler import ImageDownloader
from icrawler.builtin import GoogleImageCrawler
from six.moves.urllib.parse import urlparse

class MyImageDownloader(ImageDownloader):

    def get_filename(self, task, default_ext):
        url_real = OrderedDict()
        url_path = urlparse(task['file_url'])[2]
        #print(task['file_url'])
        url_real['url'] = task['file_url']
        # print(url_real)
        if '.' in url_path:
            extension = url_path.split('.')[-1]
            if extension.lower() not in [
                    'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
            ]:
                extension = default_ext
        else:
            extension = default_ext
        filename = base64.b64encode(url_path.encode()).decode()
        url_real['file_name'] = '{}.{}'.format(filename, extension)
        print(url_real)
        return '{}.{}'.format(filename, extension)

def get_json(keyword, save, num):
    google_crawler = GoogleImageCrawler(
        downloader_cls=MyImageDownloader,
        downloader_threads=4,
        storage={'root_dir': save})
    google_crawler.crawl(keyword=keyword, max_num=num)         

get_json('sugar glider', '/Users/user/Downloads/url_test', 1000)

Running this code still saves the image in the directory. But I don't need images. Is there any good way? In conclusion, I want to save url and filename in json file!

mattall commented 5 years ago

Hi C-YooJin, I want to do the same thing as you. Did you find a solution? Thanks! - Matt