monperrus / crawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
MIT License
1.19k stars 255 forks source link

Disambiguate http clients from crawlers/bots #374

Open srstsavage opened 1 month ago

srstsavage commented 1 month ago

I was surprised to find http clients like python-requests, Go-http-client, wget, curl, etc included in the crawler list. While I understand that these tools can be abused, in our case a large portion of our legitimate web traffic is from API requests using http clients like these.

For now I think I'll need to create an overriding allow list of patterns and remove matches from agents.Crawlers before processing, but it would be great to be able to disambiguate client tools/libraries based on a field in crawler-user-agents.json. Maybe just an is_client boolean, or a more generic tags string array which could contain client or similar? Any thoughts?

srstsavage commented 1 month ago

I'm sure I missed a few but looks like the list isn't too long

aiohttp
Apache-HttpClient
^curl
Go-http-client
http_get
httpx
libwww-perl
node-fetch
okhttp
python-requests
Python-urllib
[wW]get
monperrus commented 1 month ago

Completely see your point. I like the idea of having optional tags:

"tags": ["generic-client"]

Would you do a pull-request? Thanks!