mitchellkrogza / apache-ultimate-bad-bot-blocker

Apache Block Bad Bots, (Referer) Spam Referrer Blocker, Vulnerability Scanners, Malware, Adware, Ransomware, Malicious Sites, Wordpress Theme Detectors and Fail2Ban Jail for Repeat Offenders
Other
807 stars 175 forks source link

Internet Archive / Wayback Machine #143

Open cxtal opened 4 years ago

cxtal commented 4 years ago

Hello! The wayback machine has started to ignore any robots.txt configuration since 2017 such that there is at this time neither an opt-in nor an opt-out which probably leads to complications down the line regarding EU privacy laws, article 13 as well as numerous US and international privacy and copyright laws.

There is a typical shrink-wrap claim along the lines of "if you do not want your content to be seen, then do not put it online" which sounds like something ripped out of The Purge movies (TL;DR if I lick it, it's mine).

Could the "good bot" be turned into a "bad bot" now?

I'm quite happy with preservation myself, but no choice sounds a bit dire. This makes pirates look like baby hamsters by comparison. Either that, or we need for websites to pirate each other's content more till every website has the exact same content as every other website.

mitchellkrogza commented 4 years ago

Add this line (below) to your blacklist-user-agents.conf include file and reload nginx. I have them blocked across all my sites but they were placed in the blocker by default as "rate limited" which I think should be reviewed.

Add this line to the include file.

"~*archive.org_bot" 3;

mitchellkrogza commented 4 years ago

If your content is online anyone can see, copy or scrape it. It's one reason this blocker exists as you get to choose who can or can not scrape or even see your content.

cxtal commented 4 years ago

@mitchellkrogza Thank you for the configuration change!

Yes, perhaps the archive.org bot default rules should be revised. Before 2017 archive.org had an opt-out system whereby you could just edit robots.txt and then all the content would be gone - the authors on the cited link are leaning towards a more technical argument explaining that robots.txt itself is a bad mechanism. However, I am not contesting the technical claims to robots.txt but rather the lack of (at least) an opt-out system.

The statement "if you do not like your content to be seen, then do not put it online" is a non-sequitur (albeit aside the subject, since "seen" does not imply "own"). Using the exact same judgement and shuffling the predicates, one could say: "if you do not want your computer to ever get a virus, then do not plug it into the wall socket".

Obviously, there are IP / privacy concerns, that can even be explained by common-sense (regardless whether the law covers them or not, and they do), especially post-2017, when the original claim does not hold:

and so on and so forth.

That statement should be wrapped in concrete and thrown in some pit and then nuked. It's just silly at this point; perhaps even worthy to become a meme.