Extend bot rules - Githubissues

jocel1 commented 1 year ago

simplify / use more generic bot rules
add extra bots (ia_archiver, gtmetrix, lighthouse)

dridi commented 1 year ago

Bonjour @jocel1,

Since your change is doing two distinct things, I would rather see two commits. There's also no explanation or justification for why we should generalize certain rules. Not being a historical maintainer of this project, I can't tell why choices were made and whether it's a good idea to challenge them.

One thing you could do for example is share a list of user agents to add test coverage, to make sure we don't break previous expectations.

jocel1 commented 1 year ago

Hi @Dridi!

The main reason to add "google" is to cover Google Adsense user-agent: Mediapartners-Google. I also checked google pixels don't have "google" in their user agent, but we could perhaps add just this one.

For spider, I often discover new bots like Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/), Mozilla/5.0 (compatible; seoscanners.net/1; +spider@seoscanners.net) or CheckMarkNetwork/1.0 (+http://www.checkmarknetwork.com/spider.html), so having a generic "spider" was easier, and seems to be safe like "bot".

ia_archiver is a common bot https://user-agents.net/string/ia-archiver

I also changed facebook to match user-agent: facebookcatalog/1.0

For the last one : (?i)(web)crawler the syntax sounds like (?i)(web)?crawler was expected, to match for example:

user-agent: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)

For gtmetrix / lighthouse I don't know if we should see them as bot or not, perhaps create a new category for those ones, like "synthetic-bot" ? (we could add in them "Synthetic" to match dynatrace as well)

varnishcache / varnish-devicedetect

Extend bot rules #48