mitchellkrogza / nginx-ultimate-bad-bot-blocker

Nginx Block Bad Bots, Spam Referrer Blocker, Vulnerability Scanners, User-Agents, Malware, Adware, Ransomware, Malicious Sites, with anti-DDOS, Wordpress Theme Detector Blocking and Fail2Ban Jail for Repeat Offenders
Other
3.81k stars 472 forks source link

NOTE > [LAION / COMMON CRAWL] Are AI Crawlers Blocked? #515

Closed mitchellkrogza closed 1 year ago

mitchellkrogza commented 1 year ago

A question on many people's minds these days is the crawling of their photos and data by bots used to train AI datasets.

LAION uses the CCBot user agent (Common Crawl Bot) which has been blocked by this blocker since inception.

Regardless of the version number used by CCBot this blocker still blocks it as I don't use version numbers when detecting bots and never will.

So for those worrying about protecting their content yes the bad bot blocker does it.

Is it fool proof?

Nothing is unfortunately, others who may be training AI datasets can easily manipulate the bots User Agent name and bypass the current blocks.

This is why it's important to monitor your logs (see instructions) and when you suddenly see a lot of new traffic, or suspect looking traffic, you need to look at your logs to see what that crawler or IP did on your site and if it was up to no good you need to report the user agent or range of IP addresses here for addition.

I run a very heavy image web site with over 28000 images (currently) and I've done many lookups on "Have I been trained" with no results meaning works on this website and all the artist's it represents has (thus far) not been gobbled up by an AI bot.

So if you are concerned about protecting your works it's imperative to keep an eye (ALWAYS) on what is visiting your website and log them here for addition so others can also benefit.

FOR THOSE YOU WANT TO ALLOW IT

Just Add CCBot to your custom whitelist https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/bots.d/blacklist-user-agents.conf