Hello! Thank you very much for making the icrawler. I'm using the library in useful ways. :)
I have a question. I don't want to download images directly, I just want to get the url of images. I programmed using reference to the #34 that were uploaded last time. This is my code.
import base64
from collections import OrderedDict
from icrawler import ImageDownloader
from icrawler.builtin import GoogleImageCrawler
from six.moves.urllib.parse import urlparse
class MyImageDownloader(ImageDownloader):
def get_filename(self, task, default_ext):
url_real = OrderedDict()
url_path = urlparse(task['file_url'])[2]
#print(task['file_url'])
url_real['url'] = task['file_url']
# print(url_real)
if '.' in url_path:
extension = url_path.split('.')[-1]
if extension.lower() not in [
'jpg', 'jpeg', 'png', 'bmp', 'tiff', 'gif', 'ppm', 'pgm'
]:
extension = default_ext
else:
extension = default_ext
filename = base64.b64encode(url_path.encode()).decode()
url_real['file_name'] = '{}.{}'.format(filename, extension)
print(url_real)
return '{}.{}'.format(filename, extension)
def get_json(keyword, save, num):
google_crawler = GoogleImageCrawler(
downloader_cls=MyImageDownloader,
downloader_threads=4,
storage={'root_dir': save})
google_crawler.crawl(keyword=keyword, max_num=num)
get_json('sugar glider', '/Users/user/Downloads/url_test', 1000)
Running this code still saves the image in the directory. But I don't need images. Is there any good way? In conclusion, I want to save url and filename in json file!
Hello! Thank you very much for making the icrawler. I'm using the library in useful ways. :) I have a question. I don't want to download images directly, I just want to get the url of images. I programmed using reference to the #34 that were uploaded last time. This is my code.
Running this code still saves the image in the directory. But I don't need images. Is there any good way? In conclusion, I want to save url and filename in json file!