aka-ao commented 6 months ago

59 Related to this PR, the regexp package is very slow.

I suggest creating an interface and changing it so that processing related to regular expressions can be injected externally as a dependency.

dgoldstein0 commented 6 months ago

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

I'd be open to considering faster regex implementations if it gives a large benefit, though I'd think it'd make more sense if the regex engine we use was not exposed for dependency injection - if any regex engine is missing features we need or implements a slightly different flavor of regex which ends up having observable behavior differences, that would be problematic.

masklinn commented 1 month ago

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

While that is true, it is obviously only for user agents which have been seen before and are still in the cache by their second hit. FWIW a while back a dailymotion employee kindly provided a sample of their access logs^dailymats, which should be realistic[^1], it has ~75k entries of which ~20k unique, with a very long tail of "one hit wonders", user agents seen only once and never again (this is relevant because as I learned LRU is pretty bad at one hit wonders at low sizes). On that file, when I investigated different cache algorithms I got the following hit rates at cachesize 1000 (which I understand is what you're using):

belady[^belady] (1000): 60% hit rate lru (1000): 37% hit rate s3fifo (1000): 50% hit rate sieve (1000): 48% hit rate

So even with a cache, the regex parsing performance can be quite impactful.

With that said, rather than switching the regex library what I would recommend is looking if somebody has implemented FilteredRE2 in Go: regexes.yaml has a lot of regexes, there are more than 600 device parsers as of 0.18, a faster regex engine will not make that much of a difference in the end I think.

What FilteredRE2 does is prefilter the regexes, and then trying only those which have a chance of matching (using something like aho-corasick to quickly eliminate regex which can not match because they require literal strings not present in the input). Needing to only try 10-15 regexes is a much more significant gain than trying 600 regexes a bit faster.

An other thing you may want to look at — but this one you'll really have to bench for go specifically as regexp may or may not have this issue — is using Regexp.Match first, and only performing capture (retrieving the submatches) on matches: in both re2 and rust-regex the overhead of capture is very large (2-3x the cost of a match check), so it only takes 1-3 failures for "n match checks followed by one capture" to outperform "n captures". For this reason FilteredRE2 does not support capturing at all, after prefiltering it will only PartialMatch the selected regex: https://github.com/google/re2/blob/main/re2/filtered_re2.cc#L98-L122

[^1]: although obviously it's realistic traffic for dailymotion, a different site can have just as realistic a traffic with a different pattern [^belady]: Bélády's algorithm is a non-practical cache algorithm which can see in the future and has perfect knowledge of which entries to reject, evict, or accept, it is essentially the best that can be done at that cache size, making it a useful point of reference

ua-parser / uap-go

Allow a regular expression package to be replaced by another package. #87

59 Related to this PR, the regexp package is very slow.