Closed Zodiac1978 closed 1 year ago
Additionally we could make these identifiers filterable. Advanced users could extend the list on their own usage/experiences.
Stumbled upon another short variant for bots:
lighthouse|bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex|crawler|spider|robot|crawling
bot
would already include duckduckbot
and robot
. And you could summarize crawler
and crawling
to crawl
.
Yes, I know. These are findings on the web. lighthouse
would also match chrome-lighthouse
, etc.
This needs consolidation and decision. I am just sharing other projects solutions to bot detection.
For example: Alexa will be closed on 1st May 2022 (https://support.alexa.com/hc/en-us/articles/4410503838999), so the ia_archiver
seems to not relevant anymore and does not need to be added.
More user agent strings sorted by software name: https://developers.whatismybrowser.com/useragents/explore/software_name/
Using a composer package like https://github.com/JayBizzle/Crawler-Detect sounds like a good idea to me. what do the others think? @2ndkauboy @krafit @pfefferle @stklcode
Implemented Composer Autoload for JayBizzle/Crawler-Detect and replaced bot detection in class-statify-frontend.php with CrawlerDetect function.
Pull request: #247
Added to the 2.0.0 milestone because the composer package needs PHP 5.3 and we are on PHP 5.2 currently.
At the moment we use some string to detect crawler from the user agent string:
https://github.com/pluginkollektiv/statify/blob/667518428b30b0522367fb2c955d1913e1ef672f/inc/class-statify-frontend.php#L222-L236
We could add some more strings, like
seo
,crawling
andchrome-lighthouse
(borrowed from Koko Analytics):https://github.com/ibericode/koko-analytics/blob/18716dc9156a83e72b2967cec6dee8ce9acfdbe9/assets/src/js/script.js#L53
Looking at the biggest 10 crawlers, I think we get almost all. But Alexa is missing with their
ia_archiver
.Maybe something like
fetcher
andscraper
too ...facebookexternalhit
could be another candidate.Or we could take the big step and use a Third-Party-Library like https://github.com/JayBizzle/Crawler-Detect to detect crawlers.
Another one (in JSON) would be https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json