Open jocel1 opened 1 year ago
Bonjour @jocel1,
Since your change is doing two distinct things, I would rather see two commits. There's also no explanation or justification for why we should generalize certain rules. Not being a historical maintainer of this project, I can't tell why choices were made and whether it's a good idea to challenge them.
One thing you could do for example is share a list of user agents to add test coverage, to make sure we don't break previous expectations.
Hi @Dridi!
For the first one : (?i)(ads|google|bing|msn|yandex|baidu|ro|career|seznam|)bot
is stricly equivalent to (?i)bot
since we have at the end and empty "|" condition
The main reason to add "google" is to cover Google Adsense user-agent: Mediapartners-Google. I also checked google pixels don't have "google" in their user agent, but we could perhaps add just this one.
For spider, I often discover new bots like Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/)
, Mozilla/5.0 (compatible; seoscanners.net/1; +spider@seoscanners.net)
or CheckMarkNetwork/1.0 (+http://www.checkmarknetwork.com/spider.html)
, so having a generic "spider" was easier, and seems to be safe like "bot".
ia_archiver is a common bot https://user-agents.net/string/ia-archiver
I also changed facebook to match
user-agent: facebookcatalog/1.0
For the last one : (?i)(web)crawler the syntax sounds like (?i)(web)?crawler was expected, to match for example:
user-agent: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
For gtmetrix / lighthouse I don't know if we should see them as bot or not, perhaps create a new category for those ones, like "synthetic-bot" ? (we could add in them "Synthetic" to match dynatrace as well)