Bots detection - Githubissues

silentsoft / hits

:chart_with_upwards_trend: Hit Counter for Your GitHub or Any Kind of Websites You Want.

https://hits.sh

MIT License

100 stars 12 forks source link

Bots detection #20

Open nikolaydubina opened 1 week ago

nikolaydubina commented 1 week ago

Is there way (or ideas?) on how to detect real humans vs bots that crawl webpages?

silentsoft commented 1 week ago

While filtering based on the User-Agent in request headers is possible, it can be easily bypassed. In your case, how about using a service like Google Analytics alongside Hits?

nikolaydubina commented 1 week ago

I am using hits to avoid Google Analytics :P

reasons

no javascript, works on fully static websites, super fast (btw. google own Page Speed testing shows recommendation to "Remove Unused Javascript". that whole google analytics javascript is unused and is recommended for deletion by Google themselves...)
no privacy issues from google
easy legal (e.g. GDPR)

nikolaydubina commented 1 week ago

what if we assume that crawlers are well-behaving and do not want to bypass protections?

e.g. robots.txt are respected by major crawlers

anything like this is possible? any well-known protocols/standards to talk-to/detect crawlers? (one way I can imagine is to have reverse-proxy at hits.sh end with robots.txt that disallows going further. and then somehow making another HTTP request to your backend, no without robots who dropped at robots.txt block).

silentsoft commented 1 week ago

That sounds interesting. But if I do that, the Hits site might not be indexed by search engine anymore.. 😥 In this case, I can consider adding a condition to '*.svg' only. But some users might want to use Hits to track download counts (for instance, increasing the hit count when a file is downloaded from a page) and I think I can't confirm that this is bot or not 😂

silentsoft commented 1 week ago

Hmm.. Adding a parameter to .svg request like https://hits.sh/github.com/silentsoft.svg?blockBots might possible solution !

nikolaydubina commented 1 week ago

something like this probably will work

instead of 304 can also use meta html tag: https://www.w3.org/TR/WCAG20-TECHS/H76.html

nikolaydubina commented 1 week ago

some other method people say