If you are only searching domains, why is your crawler scraping pages???

tb0hdan / domains

World’s single largest Internet domains dataset

https://domainsproject.org

BSD 3-Clause "New" or "Revised" License

682 stars 104 forks source link

If you are only searching domains, why is your crawler scraping pages??? #2

Closed Jas0n99 closed 4 years ago

Jas0n99 commented 4 years ago

It doesn't appear that your crawler checks for a robots.txt before crawling a site. This is BAD practice if you want to be a 'legitimate' bot. Some sites do NOT want to be crawled by random bots, or there are pages you shouldn't be crawling (for various reasons)...
You appear to have ZERO rate limiting for your crawl speed. That is ABUSE... I count 40 PAGES/sec request rate (and that doesn't include other content)...

tb0hdan commented 4 years ago

Pages are scraped to collect links within site itself.
Indeed, there's no robots.txt check yet. I'll have it added soon.
Adding rate limiting defeats whole purpose of domain collection. Sites are indexed as quickly as possible and revisited not sooner that 1 month.

taladar commented 4 years ago

What you are essentially doing is running a DoS bot on any webserver that is not able to keep up with your rate of requests.

sinesc commented 4 years ago

You can just crawl multiple sites in parallel if you need to go that fast. Running almost 100 parallel requests against a site that may well have to dynamically generate the content can easily shut it down.

And yes, the site should probably implement rate limiting, but not every website on the net is run by people with the resources/knowledge to do so, or may not do so for other reasons.

stlvc commented 4 years ago

Friendly reminder - people are getting paid by other people to analyze why the heck their websites are getting shut down. So your bot is responsible for burning money - one of our customers is directly affected. Please reconsider your approach. You're hurting smaller businesses with this.

tb0hdan commented 4 years ago

All crawling activities has been stopped until load issue is fixed.

tb0hdan commented 4 years ago

Short update:

Amount of concurrent requests was decreased 8 times
Random delay was increased to 60 seconds
Robots.txt is now parsed and tested for / and any other URL that occurs on a website
Limits were checked on several servers that I have access to

Crawling activities will be restored carefully and gradually. This issue will remain open for some time in case something goes wrong with this new functionality.

tb0hdan commented 4 years ago

Well, it's been almost a month already with no apparent issues. Marking this issue as closed.

ciencia commented 3 years ago

This is a small excerpt of my access log where I've found lots of concurrent requests on the exact same second:

https://gist.github.com/ciencia/dece9b00294468ef002171eb0e8d7a37

No robots.txt access.

Of course, I'm now blocking your bot from accessing our website, and report it to all sites tracking abusing bots

aivarsi commented 3 years ago

Experiencing the same issue as ciencia. Loads of concurrent sudden requests, robots.txt not being accessed.