monperrus / crawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
MIT License
1.19k stars 254 forks source link

Add Golang package #348

Closed starius closed 7 months ago

starius commented 7 months ago

Golang package embeds the JSON file with patterns using Go's go:embed feature. Go package is kept in sync automatically with the JSON file. No manual updates of Go package are needed to keep Go package in sync.

The JSON file is parsed at load time of Go package and exposed in API as Go list of type Crawler. Functions IsCrawler and MatchingCrawlers provide a way to check User Agent if it is a crawler. The functions use go-re2 library to run regexps to achieve high speed compared to standard library regexp engine. I implemented function MatchingCrawlers in a smart way to improve performance: I combine regexps into a binary tree and use it when searching. Since RE2 works faster on large regexps than individually on each regexp, it brings speed-up.

I also provided Github workflow to run tests and banchmarks of Go package on each push.

To achieve the best performance possible in functions IsCrawler and MatchingCrawlers, install C++ RE2 into your system:

sudo apt-get install libre2-dev

and pass tag: -tags re2_cgo

starius commented 7 months ago

Github actions workflow: https://github.com/starius/crawler-user-agents/actions/workflows/golang.yml Please enable it in the repo.

monperrus commented 7 months ago

Thanks a lot for the great contribution!

I've asked for PR approval by Go experts.

starius commented 7 months ago

Added an example of Go program and fixed copy-paste in Go benchmark.

monperrus commented 7 months ago

Thanks a lot @starius I really appreciate.

We really worry about software supply chain security for crawler-user-agents (cc/ @ericcornelissen @javierron), and we would like to keep minimal external dependencies.

In particular, I'd like to remove dependency to stretchr/testify and to tetratelabs/wazero (an entire runtime).

If this means moving from wasilibs/go-re2 to the Go standard regexp, we probably have to do this.

What do you think?

starius commented 7 months ago

Thank you for feedback!

I removed stretchr/testify, it was used only it tests.

I acknowledge the problems with wazero and re2. I just caught a crash in re2, related to wazero! I switched back to Go standard regexp. It turned out to be not as bad, if regexps are checked one by one, not one regexp for all patterns. One IsCrawler call consumes 66 microseconds on Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz.

starius commented 7 months ago

I pushed another commit to check against false positives. It fixes https://github.com/monperrus/crawler-user-agents/issues/350

monperrus commented 7 months ago

great, many thanks @starius

monperrus commented 7 months ago

Hi @starius

Afterthought of @javierron: the way the regex is written, we still need to do n regex matches when matching against the two depth=1 nodes (and then some more). Maybe a trie based join approach would be better?

WDYT?

starius commented 7 months ago

Hi @monperrus !

Using TRIE looks good to me!

The only TRIE implementation in Go standard library I am aware of is https://pkg.go.dev/strings#NewReplacer We can make a replacer replacing all the patterns with an empty string and run it over the user agent. If the string changes - it means something matches. For MatchingCrawlers we can replace with some unique prefix followed by crawler ID and then extract it.

The problem is that some regexps are not just search strings, but actually use regexp syntax, e.g. "Ahrefs(Bot|SiteAudit)", "AdsBot-Google([^-]|$)", "S[eE][mM]rushBot" etc Some of them can be turned into series of strings, e.g. "Ahrefs(Bot|SiteAudit)" => "AhrefsBot", ""AhrefsSiteAudit" and added to TRIE as separate items. The small minority of complex patterns can be checked as regexps.

starius commented 7 months ago

@monperrus See https://github.com/monperrus/crawler-user-agents/pull/353