monperrus / crawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
MIT License
1.13k stars 242 forks source link

Add multiple bots found in the logs #332

Closed fekir closed 9 months ago

monperrus commented 10 months ago

Thanks, CI is failing.

fekir commented 10 months ago

I've removed some bots (I'll eventually readd them in a separate PR), how should I handle, for example, following error?

ValueError: Pattern 'deadlinkchecker' is a subset of 'LinkChecker'                                                                                                    

It is not a subset as the casing is different, and they are different bots, so the url field should be different.

monperrus commented 9 months ago

that's lots of new bots, thanks. What was your strategy to identify them?

fekir commented 9 months ago

I've grepped for insensitive literal "bot", "spider" and "crawler" after applying all your regexes

monperrus commented 9 months ago

simple and effective

monperrus commented 9 months ago

how should I handle, for example, following error?

the bot is already matched by the superset regexp, so we're safe.

fekir commented 9 months ago

how should I handle, for example, following error?

the bot is already matched by the superset regexp, so we're safe.

Ah, I see, I should execute all regexes as case-insensitive