Closed Jas0n99 closed 4 years ago
robots.txt
check yet. I'll have it added soon.What you are essentially doing is running a DoS bot on any webserver that is not able to keep up with your rate of requests.
You can just crawl multiple sites in parallel if you need to go that fast. Running almost 100 parallel requests against a site that may well have to dynamically generate the content can easily shut it down.
And yes, the site should probably implement rate limiting, but not every website on the net is run by people with the resources/knowledge to do so, or may not do so for other reasons.
Friendly reminder - people are getting paid by other people to analyze why the heck their websites are getting shut down. So your bot is responsible for burning money - one of our customers is directly affected. Please reconsider your approach. You're hurting smaller businesses with this.
All crawling activities has been stopped until load issue is fixed.
Short update:
/
and any other URL that occurs on a websiteCrawling activities will be restored carefully and gradually. This issue will remain open for some time in case something goes wrong with this new functionality.
Well, it's been almost a month already with no apparent issues. Marking this issue as closed.
This is a small excerpt of my access log where I've found lots of concurrent requests on the exact same second:
https://gist.github.com/ciencia/dece9b00294468ef002171eb0e8d7a37
No robots.txt access.
Of course, I'm now blocking your bot from accessing our website, and report it to all sites tracking abusing bots
Experiencing the same issue as ciencia. Loads of concurrent sudden requests, robots.txt not being accessed.
It doesn't appear that your crawler checks for a robots.txt before crawling a site. This is BAD practice if you want to be a 'legitimate' bot. Some sites do NOT want to be crawled by random bots, or there are pages you shouldn't be crawling (for various reasons)...
You appear to have ZERO rate limiting for your crawl speed. That is ABUSE... I count 40 PAGES/sec request rate (and that doesn't include other content)...