Closed starius closed 7 months ago
Github actions workflow: https://github.com/starius/crawler-user-agents/actions/workflows/golang.yml Please enable it in the repo.
Thanks a lot for the great contribution!
I've asked for PR approval by Go experts.
Added an example of Go program and fixed copy-paste in Go benchmark.
Thanks a lot @starius I really appreciate.
We really worry about software supply chain security for crawler-user-agents (cc/ @ericcornelissen @javierron), and we would like to keep minimal external dependencies.
In particular, I'd like to remove dependency to stretchr/testify
and to tetratelabs/wazero
(an entire runtime).
If this means moving from wasilibs/go-re2
to the Go standard regexp, we probably have to do this.
What do you think?
Thank you for feedback!
I removed stretchr/testify
, it was used only it tests.
I acknowledge the problems with wazero and re2. I just caught a crash in re2, related to wazero! I switched back to Go standard regexp. It turned out to be not as bad, if regexps are checked one by one, not one regexp for all patterns. One IsCrawler
call consumes 66 microseconds on Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz.
I pushed another commit to check against false positives. It fixes https://github.com/monperrus/crawler-user-agents/issues/350
great, many thanks @starius
Hi @starius
Afterthought of @javierron: the way the regex is written, we still need to do n regex matches when matching against the two depth=1 nodes (and then some more). Maybe a trie based join approach would be better?
WDYT?
Hi @monperrus !
Using TRIE looks good to me!
The only TRIE implementation in Go standard library I am aware of is https://pkg.go.dev/strings#NewReplacer
We can make a replacer replacing all the patterns with an empty string and run it over the user agent. If the string changes - it means something matches. For MatchingCrawlers
we can replace with some unique prefix followed by crawler ID and then extract it.
The problem is that some regexps are not just search strings, but actually use regexp syntax, e.g. "Ahrefs(Bot|SiteAudit)"
, "AdsBot-Google([^-]|$)"
, "S[eE][mM]rushBot"
etc
Some of them can be turned into series of strings, e.g. "Ahrefs(Bot|SiteAudit)" => "AhrefsBot", ""AhrefsSiteAudit"
and added to TRIE as separate items. The small minority of complex patterns can be checked as regexps.
@monperrus See https://github.com/monperrus/crawler-user-agents/pull/353
Golang package embeds the JSON file with patterns using Go's go:embed feature. Go package is kept in sync automatically with the JSON file. No manual updates of Go package are needed to keep Go package in sync.
The JSON file is parsed at load time of Go package and exposed in API as Go list of type Crawler. Functions IsCrawler and MatchingCrawlers provide a way to check User Agent if it is a crawler. The functions use go-re2 library to run regexps to achieve high speed compared to standard library regexp engine. I implemented function MatchingCrawlers in a smart way to improve performance: I combine regexps into a binary tree and use it when searching. Since RE2 works faster on large regexps than individually on each regexp, it brings speed-up.
I also provided Github workflow to run tests and banchmarks of Go package on each push.
To achieve the best performance possible in functions IsCrawler and MatchingCrawlers, install C++ RE2 into your system:
and pass tag:
-tags re2_cgo