varnishcache / varnish-devicedetect

VCL based device detection for Varnish Cache.
Other
299 stars 88 forks source link

Extend bot rules #48

Open jocel1 opened 1 year ago

jocel1 commented 1 year ago
dridi commented 1 year ago

Bonjour @jocel1,

Since your change is doing two distinct things, I would rather see two commits. There's also no explanation or justification for why we should generalize certain rules. Not being a historical maintainer of this project, I can't tell why choices were made and whether it's a good idea to challenge them.

One thing you could do for example is share a list of user agents to add test coverage, to make sure we don't break previous expectations.

jocel1 commented 1 year ago

Hi @Dridi!

For the first one : (?i)(ads|google|bing|msn|yandex|baidu|ro|career|seznam|)bot is stricly equivalent to (?i)bot since we have at the end and empty "|" condition

The main reason to add "google" is to cover Google Adsense user-agent: Mediapartners-Google. I also checked google pixels don't have "google" in their user agent, but we could perhaps add just this one.

For spider, I often discover new bots like Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/), Mozilla/5.0 (compatible; seoscanners.net/1; +spider@seoscanners.net) or CheckMarkNetwork/1.0 (+http://www.checkmarknetwork.com/spider.html), so having a generic "spider" was easier, and seems to be safe like "bot".

ia_archiver is a common bot https://user-agents.net/string/ia-archiver

I also changed facebook to match user-agent: facebookcatalog/1.0

For the last one : (?i)(web)crawler the syntax sounds like (?i)(web)?crawler was expected, to match for example:

user-agent: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)

For gtmetrix / lighthouse I don't know if we should see them as bot or not, perhaps create a new category for those ones, like "synthetic-bot" ? (we could add in them "Synthetic" to match dynatrace as well)