tasos-py / Search-Engines-Scraper

Search google, bing, yahoo, and other search engines with python
MIT License
547 stars 148 forks source link

Add sleeping #3

Closed zgruza closed 3 years ago

zgruza commented 4 years ago

Add sleeping to avoid lockout (Especially for Google) BTW. Amazing tool! ^^

tasos-py commented 4 years ago

Thanks @zgruza, I'm very glad you like my code!

The program already waits for a random interval between HTTP requests. The delay is set in a "private" _delay attribute (source: https://github.com/tasos-py/Search-Engines-Scraper/blob/ee6bc462121269dfc2dbb1e520f6796cd20a4cee/search_engines/core/engine.py#L14) and it is 1-4 sec for all Search objects except Google that waits for 2-6 sec. If you're getting too many bans you could increase the delay, but I can't guarantee that this will fool Google's bot detection algorithms. Also note that Google may issue a ban for other reasons - frequent use of advanced search operators (eg inurl:, intext:) or use of keywords related to malicious activity (eg public exploits).

I think the best way to avoid bans is to request less than 20 search results pages for each search engine and use more than one search engines if you want to get more results.

zgruza commented 4 years ago

I just figured how to get off that ban. After the last Ban I did not get rid of even after changing the IP address. Then I changed User-Agent and works! :)

When I get Ban is there any waiting time? or forever?

This tool is really great source for researching.. really Awesome! 🤗 Looking forward to add new sources, Working now on Candle Search Engine so pulling request soon. 😋

tasos-py commented 4 years ago

Great idea! It seems Google's bans apply to a specific IP+User-Agent combination. For example, if a Google object gets banned, it also applies to Firefox browser (all Search objects use Firefox as a User-Agent), but not to Chrome or IE. I'm a little surprised that the ban persists after changing the IP though. Maybe I should update the default User-Agent string, Firefox/51.0 is not that common nowadays and I wonder if that could trigger Google's bot detection system.

From my experience Google's bans don't last that long - about 5-10 minutes - but the problem is that they also affect the Firefox browser and in that case they expire only if you solve the captcha or clear the site cookies. I haven't tested if bans get progressively longer and more frequent, so using a proxy is probably a good idea.

I'm very glad to hear that you're planning to expand my project! I think the code's structure is quite simple, but if you come across any problems I'll be glad to help. Also, I'm planning to make some updates myself:

  1. Fix Yahoo's and Aol's CSS selectors because sometimes they produce incomplete URLs.
  2. Remove Searx or find another Searx instance; the main site doesn't seem to work anymore.
  3. Clean the code a little - put every search engine class in a separate file, use more descriptive names.

So, if you wait for a couple of weeks you should have better code to work with.

tasos-py commented 4 years ago

I updated the code, have fun with it!