A question on many people's minds these days is the crawling of their photos and data by bots used to train AI datasets.
LAION uses the CCBot user agent (Common Crawl Bot) which has been blocked by this blocker since inception.
Regardless of the version number used by CCBot this blocker still blocks it as I don't use version numbers when detecting bots and never will.
So for those worrying about protecting their content yes the bad bot blocker does it.
Is it fool proof?
Nothing is unfortunately, others who may be training AI datasets can easily manipulate the bots User Agent name and bypass the current blocks.
This is why it's important to monitor your logs (see instructions) and when you suddenly see a lot of new traffic, or suspect looking traffic, you need to look at your logs to see what that crawler or IP did on your site and if it was up to no good you need to report the user agent or range of IP addresses here for addition.
I run a very heavy image web site with over 28000 images (currently) and I've done many lookups on "Have I been trained" with no results meaning works on this website and all the artist's it represents has (thus far) not been gobbled up by an AI bot.
So if you are concerned about protecting your works it's imperative to keep an eye (ALWAYS) on what is visiting your website and log them here for addition so others can also benefit.
A question on many people's minds these days is the crawling of their photos and data by bots used to train AI datasets.
LAION uses the CCBot user agent (Common Crawl Bot) which has been blocked by this blocker since inception.
Regardless of the version number used by CCBot this blocker still blocks it as I don't use version numbers when detecting bots and never will.
So for those worrying about protecting their content yes the bad bot blocker does it.
Is it fool proof?
Nothing is unfortunately, others who may be training AI datasets can easily manipulate the bots User Agent name and bypass the current blocks.
This is why it's important to monitor your logs (see instructions) and when you suddenly see a lot of new traffic, or suspect looking traffic, you need to look at your logs to see what that crawler or IP did on your site and if it was up to no good you need to report the user agent or range of IP addresses here for addition.
I run a very heavy image web site with over 28000 images (currently) and I've done many lookups on "Have I been trained" with no results meaning works on this website and all the artist's it represents has (thus far) not been gobbled up by an AI bot.
So if you are concerned about protecting your works it's imperative to keep an eye (ALWAYS) on what is visiting your website and log them here for addition so others can also benefit.
FOR THOSE YOU WANT TO ALLOW IT
Just Add CCBot to your custom whitelist https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/bots.d/blacklist-user-agents.conf