Open mitchellkrogza opened 3 years ago
Dumb computers treat Googlebot
and tobegoGtlo
the same because they all consist of letters.
I’d say “nearly impossible in traditional ways.” But let me try:
1I4oK7gTbyej
)User agents usually don’t have a word that goes numbers–letters–numbers. As they’re random, they have unusual arrangement of characters.
When you see 1I4
, you flag it as a bad one. You could try numbers–letters–numbers–letters as well to minimize false-positives.
Usually names in user agents are easy for humans to pronounce, because they should be. Calculate the pronounceabilities of the words with some linguistic magic, and flag those difficult to pronounce as bad ones.
Train an AI with good user agents and bad ones, over and over, and let it magically flag bad ones.
because it doesn’t stop attackers with sufficient malicious intention from easily bypass all of the elaborated filters. Would you introduce new filters whenever they lengthen it, shorten it, stick to letters-only, mimic other good bots, put Fortune 500 brand names in their UA?
Once randomized, they need not be fixed — they can be whatever to eliminate your filters.
It’s beyond this blocker’s job and not worth the processing power needed. I’d rather focus on blocking branded bad bots.
I gave up on this, most seem to have stopped anyway and anyway anyone can just masquerade as Mozilla/5.0
Seeing a lot of these in my logs.
Anyone with a good regex pattern to catch these without causing any false positives elsewhere?