hellock / icrawler

A multi-thread crawler framework with many builtin image crawlers provided.
http://icrawler.readthedocs.io/en/latest/
MIT License
854 stars 174 forks source link

Google Crawler is down #65

Closed kanjieater closed 4 years ago

kanjieater commented 4 years ago

I've been using icrawler's google crawler for a while, but it seems like it's down now. Can we get the library updated?

beihangwenbin commented 4 years ago

Google recently changed the way they present the image data, and so the links were no longer being scraped. the new Parser code is:

class GoogleParser(Parser):
    def parse(self, response):
        soup = BeautifulSoup(response.content.decode('utf-8', 'ignore'), 'lxml')
        image_divs = soup.find_all('img', class_='rg_i Q4LuWd tx8vtf')
        print(len(image_divs))
        for meta in image_divs:
            url = meta.get('data-iurl')
            if url:
                yield dict(file_url=url)

However,it is just work for 20 pics.I don't know how to page turning

kanjieater commented 4 years ago

Thanks for that. It's an ok band-aid for my current situation. The language setting of #28 is now broken as well

class_='rg_i Q4LuWd tx8vtf' is an autogenerated class it would seem, making this a highly fragile method of selection, that could break again at any time.

MeekoI commented 4 years ago

I encountered the same problem.

mueller91 commented 4 years ago

same problem here

Skyquek commented 4 years ago

same problem here

hellock commented 4 years ago

Thanks for all your reports. I find that Google image search has changed its apis, thus the parser need to be adapted. However I am quite busy recently and may not have enough time to handle this.

marioguima commented 4 years ago

@hellock , congratulations on your excellent work with this Google Crawler

I wish I could help

If you guide me through the code I believe I can try

I noticed that the min_size parameter doesn't work anymore google_crawler.crawl(keyword = key, filters = filters, min_size = (1200, 600), max_num = maxImages)

So it works (without it) google_crawler.crawl(keyword = key, filters = filters, max_num = maxImages)

Downloading only the image thumbnails

Do you have any idea where to start?

Would like to help

pasqLisena commented 4 years ago

Pull request for this issue!

Luke256 commented 4 years ago

I'm using icrawler.But I have a problem.The picture I choosed wasn't download into my machine.Could you tell me the way to solve that problem?

huangyangyu commented 4 years ago

The following code works for me. Hope it could help you. class GoogleParser(Parser):

def parse(self, response):
    soup = BeautifulSoup(
        response.content.decode('utf-8', 'ignore'), 'lxml')
    #image_divs = soup.find_all('script')
    image_divs = soup.find_all(name='script')
    for div in image_divs:
        #txt = div.text
        txt = str(div)
        #if not txt.startswith('AF_initDataCallback'):
        if 'AF_initDataCallback' not in txt:
            continue
        if 'ds:0' in txt or 'ds:1' not in txt:
            continue
        #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
        #             "\\2", txt, 0, re.DOTALL)
        #meta = json.loads(txt)
        #data = meta[31][0][12][2]
        #uris = [img[1][3][0] for img in data if img[0] == 1]

        uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
        return [{'file_url': uri} for uri in uris]
knappmk commented 4 years ago

Changes committed with 47c1f4f for GoogleParser do not work with the newest beautifulsoup4 4.9 since .text for a script tag changed to return always an empty string. I would suggest to change this line to something like txt = "" if div.string is None else div.string.