pluginkollektiv / statify

Statify – statistics plugin for WordPress
https://wordpress.org/plugins/statify/
GNU General Public License v3.0
76 stars 22 forks source link

Optimize bot detection #217

Closed Zodiac1978 closed 1 year ago

Zodiac1978 commented 3 years ago

At the moment we use some string to detect crawler from the user agent string:

https://github.com/pluginkollektiv/statify/blob/667518428b30b0522367fb2c955d1913e1ef672f/inc/class-statify-frontend.php#L222-L236

We could add some more strings, like seo, crawling and chrome-lighthouse (borrowed from Koko Analytics):

https://github.com/ibericode/koko-analytics/blob/18716dc9156a83e72b2967cec6dee8ce9acfdbe9/assets/src/js/script.js#L53

Looking at the biggest 10 crawlers, I think we get almost all. But Alexa is missing with their ia_archiver.

Maybe something like fetcherand scraper too ...

facebookexternalhit could be another candidate.

Or we could take the big step and use a Third-Party-Library like https://github.com/JayBizzle/Crawler-Detect to detect crawlers.

Another one (in JSON) would be https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json

Zodiac1978 commented 2 years ago

Additionally we could make these identifiers filterable. Advanced users could extend the list on their own usage/experiences.

Zodiac1978 commented 2 years ago

Stumbled upon another short variant for bots: lighthouse|bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex|crawler|spider|robot|crawling

MatzeKitt commented 2 years ago

bot would already include duckduckbot and robot. And you could summarize crawler and crawling to crawl.

Zodiac1978 commented 2 years ago

Yes, I know. These are findings on the web. lighthouse would also match chrome-lighthouse, etc.

This needs consolidation and decision. I am just sharing other projects solutions to bot detection.

Zodiac1978 commented 2 years ago

For example: Alexa will be closed on 1st May 2022 (https://support.alexa.com/hc/en-us/articles/4410503838999), so the ia_archiver seems to not relevant anymore and does not need to be added.

Zodiac1978 commented 2 years ago

More user agent strings sorted by software name: https://developers.whatismybrowser.com/useragents/explore/software_name/

florianbrinkmann commented 1 year ago

Using a composer package like https://github.com/JayBizzle/Crawler-Detect sounds like a good idea to me. what do the others think? @2ndkauboy @krafit @pfefferle @stklcode

00Sleepy commented 1 year ago

Implemented Composer Autoload for JayBizzle/Crawler-Detect and replaced bot detection in class-statify-frontend.php with CrawlerDetect function.

Pull request: #247

florianbrinkmann commented 1 year ago

Added to the 2.0.0 milestone because the composer package needs PHP 5.3 and we are on PHP 5.2 currently.