internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.8k stars 764 forks source link

archive web crawler - crawl speed #562

Closed solaceten closed 1 year ago

solaceten commented 1 year ago

Hi there

I am a server administrator. I have been finding that the web crawler for archive.org has been crawling websites on our servers at a very overwhelming rate - causing servers to become unstable and have very high server load.

I am wondering if your crawler has a robots.txt or equivalent code that will allow us to slow down your crawler speed or crawl rate (similar to the google search engine speed control mentioned here )

Thank you for your assistance.

solaceten commented 1 year ago

Further to my previous messages... here is an example of your misbehaving BOT - trying to access plugin files ? None of which actually exist.

Below is a small example of many thousands of hits, all in the space of a few minutes.

IP:
207.241.230.103 207.241.232.92 207.241.230.131 207.241.232.89

Apache Server Status for 127.0.0.1 (via 127.0.0.1) Server Version: Apache/2.4.57 (cPanel) OpenSSL/1.1.1t mod_bwlimited/1.4Server MPM: preforkServer Built:


Current Time: Friday, 19-May-2023 14:27:49 NZST Server load: 38.38 Total accesses: 31188 - Total Traffic: 773.9 MB - Total Duration: 61212801 CPU Usage: u2.37 s8.75 cu13602.8 cs2568.91 - 38.7% CPU load.746 requests/sec - 19.0 kB/second - 25.4 kB/request - 1962.7 ms/request52 requests currently being processed, 47 idle workers

Srv PID Acc M CPU SS Req Dur Conn Child Slot Client Protocol VHost Request 0-0 4881 0/18/2007 W 15.04 29 0 4353753 0.0 0.76 53.28 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/schema-and-structured-data-for-wp/admin 1-0 8023 0/11/1926 5.43 0 26476 3802995 0.0 0.25 46.69 207.241.232.92 http/1.1 example.com:443 GET /wp-content/plugins/ultimate-social-media-icons/js/shuffle/ 2-0 10855 0/4/2002 2.03 0 33757 4027619 0.0 0.11 51.60 207.241.232.92 http/1.1 example.com:443 GET /wp-content/plugins/counter-number-showcase/assets/css/boot 3-0 11510 0/3/1831 1.46 0 2426 4368181 0.0 0.02 46.56 207.241.230.103 http/1.1 example.com:443 GET /wp-content/uploads/the-core-style.css?ver=1667594027 HTTP/ 4-0 10858 0/4/1760 2.94 2 10255 3437994 0.0 0.08 53.47 207.241.230.131 http/1.1 example.com:443 GET /wp-content/plugins/unyson/framework/extensions/shortcodes/ 5-0 3437 1/31/1768 W 17.35 35 0 3242961 6.7 0.86 48.85 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/tickera-event-ticketing-system/css/elem 6-0 10859 0/3/1731 1.49 0 43765 3219504 0.0 0.07 40.28 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/tickera-event-ticketing-system/css/font 7-0 3438 0/29/1470 18.61 1 10331 3748227 0.0 0.47 36.51 207.241.230.131 http/1.1 example.com:443 GET /wp-content/plugins/wp-data-access/assets/js/wpda_restapi. 8-0 11530 0/3/1463 1.45 0 1446 3189077 0.0 0.02 30.93 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/counter-number-showcase/assets/js/waypo 9-0 8044 0/11/1378 W 10.88 30 0 2774315 0.0 0.30 31.36 207.241.232.89 http/1.1 example.com:443 GET /wp-content/plugins/counter-number-showcase/assets/css/coun 10-0 11531 0/2/1596 _ 1.10 2 14395 2616619 0.0 0.02 45.35 207.241.230.131 http/1.1 example.com:443 GET /wp-content/plugins/animated-number-counters/assets/js/anc- 11-0 11543 0/0/1382 W 0.00 43 0 2836848 0.0 0.00 31.04 207.241.232.90 http/1.1 example.com:443 GET /wp-content/plugins/tickera-event-ticketing-system/css/fron 12-0 2286 0/62/1296 W 40.61 41 0 2375666 0.0 2.09 35.78 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/unyson/framework/extensions/shortcodes/ 13-0 11544 0/0/1152 W 0.00 43 0 2068651 0.0 0.00 26.34 207.241.232.89 http/1.1 example.com:443 GET /wp-content/plugins/counter-number-showcase/assets/css/font 14-0 11545 0/0/1106 W 0.00 42 0 2123248 0.0 0.00 29.86 207.241.230.131 http/1.1 example.com:443 GET /wp-content/plugins/tickera-event-ticketing-system/css/elem 15-0 8047 0/10/874 W 2.97 45 0 1312237 0.0 0.11 19.19 207.241.232.89 http/1.1 example.com:443 GET /wp-content/plugins/wp-data-access/assets/css/wpda_public.c 16-0 11546 0/0/928 W 0.00 42 0 2016284 0.0 0.00 24.29 207.241.232.89 http/1.1 example.com:443 GET /wp-content/plugins/counter-number-showcase/assets/css/boot 17-0 11559 0/0/658 W 0.00 42 0 883667 0.0 0.00 12.52 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/counter-number-showcase/assets/css/font 18-0 11569 0/0/608 W 0.00 42 0 1074856 0.0 0.00 17.50 207.241.232.90 http/1.1 example.com:443 GET /wp-content/plugins/unyson/framework/extensions/shortcodes/ 19-0 11570 0/0/491 W 0.00 42 0 738297 0.0 0.00 11.43 207.241.230.103 http/1.1 example.com:443 GET /wp-content/plugins/tickera-event-ticketing-system/css/font

solaceten commented 1 year ago

I am experiencing very high levels of traffic from your IPs.

207.241.230.103 207.241.232.92 207.241.230.131 207.241.232.90 207.241.232.89

I don't want to blacklist them, but will have no choice if we cannot find a resolution

Time: Fri Jun 16 09:56:01 2023 +1200 1 Min Load Avg: 37.34 5 Min Load Avg: 12.08 15 Min Load Avg: 5.72 Running/Total Processes: 46/402

Other people are reporting you as malicious.

https://www.abuseipdb.com/check/207.241.232.90 https://www.abuseipdb.com/check/207.241.230.131 https://www.abuseipdb.com/check/207.241.230.103 https://www.abuseipdb.com/check/207.241.232.92 https://www.abuseipdb.com/check/207.241.232.89

Your developers need to sort this out.

anjackson commented 1 year ago

Please contact the Internet Archive directly, as this site is used for collaborative development, and they may miss complaints raised here.

See https://archive.org/about/contact.php or contact info@archive.org

solaceten commented 1 year ago

I already contacted them directly and they have not replied in many weeks.

solaceten commented 1 year ago

This is the last response I got... Each follow up I send receives nothing.

Support ID 827830

May 10, 2023, 17:23 PDT I am a server administrator. I have been finding that the web crawler for archive.org has been crawling websites on our servers at a very overwhelming rate - causing servers to become unstable and have very high server load.

I am wondering if your crawler has a robots.txt or equivalent code that will allow us to slow down your crawler speed or crawl rate (similar to the google search engine speed control mentioned here )

Thank you for your assistance. Sol

===

Patron Services Yellow (Internet Archive) Internet Archive support@archivesupport.zendesk.com May 17, 2023, 19:50 PDT

I am very sorry! Can you please tell me which server(s) this is effecting? I will see what we can do. Thanks!

Mark Graham, Director, the Wayback Machine at the Internet Archive

==

Thank you for your reply.

Actually we have many servers and have noticed it across multiple sites.

Is there a robots.txt or equivalent code that will allow us to slow down your crawler speed or crawl rate (similar to the google search engine speed control mentioned here )

Thanks

====

Patron Services Yellow (Internet Archive) Internet Archive support@archivesupport.zendesk.com May 17, 2023, 21:27 PDT

No... I am very sorry but we don't support that. I wish we did!  Mark Graham, Director, the Wayback Machine at the Internet Archive Patron Services Yellow 

====

May 18, 2023, 4:31 PM Hmmm well that's a bit of a concern.

OK we will have to develop a way to target your crawler and refuse connections after x amount of requests within x amount of time.

Do you have a list of crawler IP addresses or Host Name, User Agent or ASN numbers that your system uses?

Thanks

===

May 18, 2023, 4:31 PM

Further to my previous messages... here is an example of your misbehaving BOT - trying to access plugin files ?

IP:
207.241.230.103 207.241.232.92 207.241.230.131 207.241.232.89

===

June 16, 2023, 4:31 PM Hello again

I did not get any update or reply from you.

I am still experiencing very high levels of traffic from your IPs.

207.241.230.103 207.241.232.92 207.241.230.131 207.241.232.90 207.241.232.89

I don't want to blacklist them, but will have no choice if we cannot find a resolution

Time: Fri Jun 16 09:56:01 2023 +1200 1 Min Load Avg: 37.34

====

June 27, 2023, 10"03 AM I still have not received any response.

====

June 27, 2023, 10"03 AM I still have not received any response.

anjackson commented 1 year ago

Just to be clear: I don't work for the Internet Archive.

However, I do know they archive many thousands of sites every day. It is unlikely that they can track down any errant behaviour without more information from you. As they said:

Can you please tell me which server(s) this is effecting?

FWIW, in my experience, the most helpful thing is to have a snippet of your server logs that includes the hosts, URLs, user agents etc.

Failing that, I suggest you look at configuring your web server to use IP-based rate limiting, to keep the volume of requests at a level you are comfortable with.

solaceten commented 1 year ago

OK thanks. I did email them with lots of server logs - I just snipped it above for brevity.....

They are extremely poor at responding. I suspect this is simply too hard for them.

We already have rate based limiting and that works fine, but it is a poor show when service providers are attacking servers. So I reached out to them for two reasons 1) to ask if they knew it was happening and 2) to see if they could slow it down....

Fail on both counts.....