Open cxtal opened 4 years ago
Add this line (below) to your blacklist-user-agents.conf
include file and reload nginx. I have them blocked across all my sites but they were placed in the blocker by default as "rate limited" which I think should be reviewed.
Add this line to the include file.
"~*archive.org_bot" 3;
If your content is online anyone can see, copy or scrape it. It's one reason this blocker exists as you get to choose who can or can not scrape or even see your content.
@mitchellkrogza Thank you for the configuration change!
Yes, perhaps the archive.org bot default rules should be revised. Before 2017 archive.org had an opt-out system whereby you could just edit robots.txt
and then all the content would be gone - the authors on the cited link are leaning towards a more technical argument explaining that robots.txt
itself is a bad mechanism. However, I am not contesting the technical claims to robots.txt
but rather the lack of (at least) an opt-out system.
The statement "if you do not like your content to be seen, then do not put it online"
is a non-sequitur (albeit aside the subject, since "seen" does not imply "own"). Using the exact same judgement and shuffling the predicates, one could say: "if you do not want your computer to ever get a virus, then do not plug it into the wall socket".
Obviously, there are IP / privacy concerns, that can even be explained by common-sense (regardless whether the law covers them or not, and they do), especially post-2017, when the original claim does not hold:
and so on and so forth.
That statement should be wrapped in concrete and thrown in some pit and then nuked. It's just silly at this point; perhaps even worthy to become a meme.
Hello! The wayback machine has started to ignore any robots.txt configuration since 2017 such that there is at this time neither an opt-in nor an opt-out which probably leads to complications down the line regarding EU privacy laws, article 13 as well as numerous US and international privacy and copyright laws.
There is a typical shrink-wrap claim along the lines of "if you do not want your content to be seen, then do not put it online" which sounds like something ripped out of The Purge movies (TL;DR if I lick it, it's mine).
Could the "good bot" be turned into a "bad bot" now?
I'm quite happy with preservation myself, but no choice sounds a bit dire. This makes pirates look like baby hamsters by comparison. Either that, or we need for websites to pirate each other's content more till every website has the exact same content as every other website.