hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
854 stars 174 forks source link

Is date_min and date_max for Google Image Search working? #32

Closed DeepVoltaire closed 7 years ago

DeepVoltaire commented 7 years ago

Hi, thanks for the great package!

I am trying to get minimum and maximum date working for Google Image Search. My minimum example

from icrawler.builtin import GoogleImageCrawler
from datetime import date

def scrape():
    date_min = date(1985, 1, 1)
    date_max = date(1990, 1, 1)
    with open("labels.txt", mode="r") as f:
        for j in range(sum(1 for line in open("labels.txt", mode="r"))):
            label = "{}".format(f.readline().rstrip())
            google_crawler = GoogleImageCrawler(
                parser_threads=4, downloader_threads=20, storage={'root_dir': 'test/{}'.format(label)})
            google_crawler.crawl(keyword=label, max_num=1000, date_min=date_min, date_max=date_max)

if __name__ == "__main__":
    scrape()

is not working as expected. It should download no images, but is downloading images from all time periods. Am I doing something wrong?

Thanks for your help!

hellock commented 7 years ago

This bug was introduced when adding support for filtering image usage rights, and it is fixed now. You can try the latest version.

DeepVoltaire commented 7 years ago

Thanks, it works now! I think there is still a little bug, because I get 104 results when searching from date(2011, 1, 1) to date(2011, 2, 1), but 457 from date(2011, 1, 1) to date(2011, 1, 2). It should be the other way around I think, because of american date notation %m/%d/%Y, so in google.py instead of:

cd_min = date_min.strftime('%d/%m/%Y') if date_min else ''
cd_max = date_max.strftime('%d/%m/%Y') if date_max else ''
cd_min = date_min.strftime('%m/%d/%Y') if date_min else ''
cd_max = date_max.strftime('%m/%d/%Y') if date_max else ''

I created a pull request.

hellock commented 7 years ago

Thanks a lot for figuring this out!

hellock commented 7 years ago

Date format fixed in #33