silentsoft / hits

:chart_with_upwards_trend: Hit Counter for Your GitHub or Any Kind of Websites You Want.
https://hits.sh
MIT License
100 stars 12 forks source link

Bots detection #20

Open nikolaydubina opened 1 week ago

nikolaydubina commented 1 week ago

Is there way (or ideas?) on how to detect real humans vs bots that crawl webpages?

silentsoft commented 1 week ago

While filtering based on the User-Agent in request headers is possible, it can be easily bypassed. In your case, how about using a service like Google Analytics alongside Hits?

nikolaydubina commented 1 week ago

I am using hits to avoid Google Analytics :P

reasons

nikolaydubina commented 1 week ago

what if we assume that crawlers are well-behaving and do not want to bypass protections?

e.g. robots.txt are respected by major crawlers

anything like this is possible? any well-known protocols/standards to talk-to/detect crawlers? (one way I can imagine is to have reverse-proxy at hits.sh end with robots.txt that disallows going further. and then somehow making another HTTP request to your backend, no without robots who dropped at robots.txt block).

silentsoft commented 1 week ago

That sounds interesting. But if I do that, the Hits site might not be indexed by search engine anymore.. 😥 In this case, I can consider adding a condition to '*.svg' only. But some users might want to use Hits to track download counts (for instance, increasing the hit count when a file is downloaded from a page) and I think I can't confirm that this is bot or not 😂

silentsoft commented 1 week ago

Hmm.. Adding a parameter to .svg request like https://hits.sh/github.com/silentsoft.svg?blockBots might possible solution !

nikolaydubina commented 1 week ago

something like this probably will work

instead of 304 can also use meta html tag: https://www.w3.org/TR/WCAG20-TECHS/H76.html

image
nikolaydubina commented 1 week ago

some other method people say

image