monperrus / crawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
MIT License
1.13k stars 242 forks source link

Suggestion: test against false positives #350

Closed starius closed 3 months ago

starius commented 3 months ago

Context

I use the package to distinguish crawlers from human users in HTTP server. The logic is to prevent crawlers from "spoiling" one time links shared in Discord and similar chats which request all the links sent to chats to make preview. Because the link is one-time, the request from the crawler uses it and it does not open when human user opens it. I solved this by blocking access from crawlers to such links. If you need more details, please see https://github.com/starius/pasta/issues/8

Danger of false positives

If some legit browser sends User Agent which accidentally matches one of patterns, the user won't be able to access the link, because the site will treat this request as originated by a crawler.

I guess, other uses of this package will also benefit if false positives are minimized.

Proposed solution

Let's add a test to CI which runs most common User Agents through the patterns and fails if any of them matches. The list of patterns can be loaded from here: https://github.com/microlinkhq/top-user-agents/tree/master/src If somebody adds a pattern which matches any of them, it will be early detected and prevented. Also if some popular browser starts using some User Agent accidentally matching one of patterns, this will also trigger the test failure.

monperrus commented 3 months ago

Excellent idea! Looking forward to the PR.

monperrus commented 3 months ago

closed by #348