talleyhoe / google-image-scraper

Simple google images scraper without chromium
GNU General Public License v3.0
19 stars 5 forks source link

Fix infinite loop when get_image_urls returns empty dict {} #2

Closed bealbrown closed 6 months ago

bealbrown commented 6 months ago

Hello,

Ran into this when I was using this great project!

When you run a search that returns 0 results, the script gets stuck in the get_manifest() while loop.

Therefore, we check to see if there's an empty dict returned, and if so, break and take what results we have, which is often 0.

Otherwise we start spamming the Google endpoint with as many requests as python can manage to produce per second haha.

I did some testing on this, but I say probably worth approaching it from your own first principles.

Here was the test I did:

With a query that returns 0 results, and some debug logging to see get_image_urls function calls

root@lvm1:~/google-image-scraper# python3 src/main.py "\"kj2h35kj5h5h25hj4235on235ov4n3v6op45i7567m567n25fffffffffffffffffjkh235gh\"" -c 5 -d ./images
"kj2h35kj5h5h25hj4235on235ov4n3v6op45i7567m567n25fffffffffffffffffjkh235gh"
getting image_urls
<Response [200]>
getting image_urls
<Response [200]>
getting image_urls
<Response [200]>
getting image_urls
<Response [200]>
getting image_urls
<Response [200]>
getting image_urls
<Response [200]>
getting image_urls
[...] ad infinitum until Google bans the IP

With the present fix enabled

root@lvm1:~/google-image-scraper# python3 src/main.py "\"kj2h35kj5h5h25hj4235on235ov4n3v6op45i7567m567n25fffffffffffffffffjkh235gh\"" -c 5 -d ./images
"kj2h35kj5h5h25hj4235on235ov4n3v6op45i7567m567n25fffffffffffffffffjkh235gh"
getting image_urls
<Response [200]>
Found 0 of 5 image sources
0it [00:00, ?it/s]
talleyhoe commented 6 months ago

This all looks great! Thanks for the write-up and testing, glad you're enjoying the project :) Merging