Closed kanjieater closed 4 years ago
Google recently changed the way they present the image data, and so the links were no longer being scraped. the new Parser code is:
class GoogleParser(Parser):
def parse(self, response):
soup = BeautifulSoup(response.content.decode('utf-8', 'ignore'), 'lxml')
image_divs = soup.find_all('img', class_='rg_i Q4LuWd tx8vtf')
print(len(image_divs))
for meta in image_divs:
url = meta.get('data-iurl')
if url:
yield dict(file_url=url)
However,it is just work for 20 pics.I don't know how to page turning
Thanks for that. It's an ok band-aid for my current situation. The language setting of #28 is now broken as well
class_='rg_i Q4LuWd tx8vtf'
is an autogenerated class it would seem, making this a highly fragile method of selection, that could break again at any time.
I encountered the same problem.
same problem here
same problem here
Thanks for all your reports. I find that Google image search has changed its apis, thus the parser need to be adapted. However I am quite busy recently and may not have enough time to handle this.
@hellock , congratulations on your excellent work with this Google Crawler
I wish I could help
If you guide me through the code I believe I can try
I noticed that the min_size parameter doesn't work anymore google_crawler.crawl(keyword = key, filters = filters, min_size = (1200, 600), max_num = maxImages)
So it works (without it) google_crawler.crawl(keyword = key, filters = filters, max_num = maxImages)
Downloading only the image thumbnails
Do you have any idea where to start?
Would like to help
Pull request for this issue!
I'm using icrawler.But I have a problem.The picture I choosed wasn't download into my machine.Could you tell me the way to solve that problem?
The following code works for me. Hope it could help you. class GoogleParser(Parser):
def parse(self, response):
soup = BeautifulSoup(
response.content.decode('utf-8', 'ignore'), 'lxml')
#image_divs = soup.find_all('script')
image_divs = soup.find_all(name='script')
for div in image_divs:
#txt = div.text
txt = str(div)
#if not txt.startswith('AF_initDataCallback'):
if 'AF_initDataCallback' not in txt:
continue
if 'ds:0' in txt or 'ds:1' not in txt:
continue
#txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
# "\\2", txt, 0, re.DOTALL)
#meta = json.loads(txt)
#data = meta[31][0][12][2]
#uris = [img[1][3][0] for img in data if img[0] == 1]
uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
return [{'file_url': uri} for uri in uris]
I've been using icrawler's google crawler for a while, but it seems like it's down now. Can we get the library updated?