istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

subdomain based throttling #90

Closed softwarevamp closed 7 years ago

softwarevamp commented 7 years ago

I have requests like these:

http://list.domain.com/list.html?cat=<cat>
http://item.domain.com/<id>.html

Currently the throttle is domain based, but i want subdomain based. Because for large website the servers are separate.

madisonb commented 7 years ago

Do you have any documentation where web scraping blockers like CloudFlare or Incapsula block based on sub domain? I have never heard of that, and I have even encountered instances where many sites and sub domains are all orchestrated under one blocker, so if you get banned for scraping one site too hard it propagates and you are banned from other sites as well.

Sub domains don't necessarily have to exist on different servers or use separate scraper blocking software. Regardless, thanks to the TLDExtract library used in this project this could be implemented as an optional flag.

madisonb commented 7 years ago

If you can provide an update here that would be great, otherwise I am going to close this soon due to lack of interest and no supporting documentation that anti-scraping providers block based on subdomain.

madisonb commented 7 years ago

Closing.