monperrus / crawler-user-agents

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
MIT License
1.13k stars 242 forks source link

golang: speedup using TRIE #353

Open starius opened 2 months ago

starius commented 2 months ago

Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching.

I've applied these simplifications and modifications. There are two tables: one replaces a pattern with a list of possible search strings, while the other matches rare patterns requiring regexp with specific strings indicating their possible presence in text. The specific string is needed to know when to run the regexp.

Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format).

All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match.

Benchmark comparison:

$ benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: github.com/monperrus/crawler-user-agents
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                           │    old.txt    │               new.txt               │
                           │    sec/op     │   sec/op     vs base                │
IsCrawlerPositive-2          71.384µ ±  7%   1.535µ ± 3%  -97.85% (p=0.000 n=10)
MatchingCrawlersPositive-2   70.597µ ±  2%   1.586µ ± 1%  -97.75% (p=0.000 n=10)
IsCrawlerNegative-2          71.072µ ± 11%   1.747µ ± 4%  -97.54% (p=0.000 n=10)
MatchingCrawlersNegative-2   67.978µ ±  1%   1.723µ ± 2%  -97.47% (p=0.000 n=10)
geomean                       70.24µ         1.645µ       -97.66%

                           │    old.txt    │                 new.txt                 │
                           │      B/s      │      B/s       vs base                  │
IsCrawlerPositive-2          2.112Mi ±  7%   98.205Mi ± 3%  +4548.98% (p=0.000 n=10)
MatchingCrawlersPositive-2   2.131Mi ±  2%   95.029Mi ± 1%  +4358.39% (p=0.000 n=10)
IsCrawlerNegative-2          2.055Mi ± 10%   83.528Mi ± 4%  +3964.27% (p=0.000 n=10)
MatchingCrawlersNegative-2   2.146Mi ±  1%   84.710Mi ± 2%  +3847.78% (p=0.000 n=10)
geomean                      2.111Mi          90.14Mi       +4170.39%

New implementation is 40 times faster!

monperrus commented 2 months ago

thanks @starius

@javierron are you able to code-review this one?

starius commented 2 months ago

Hi @javierron !

Thank you very much for the review! I addressed the feedback, see my comments.

Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern

Can you elaborate on this, please? analizePattern already determines automatically if a pattern is a literal:

prefix, complete := re.LiteralPrefix()
if complete {
  return []string{prefix}, nil
}

(I put this code after checking presence in pattern2literals table, because it is faster than compiling the regexp, so regexp compilation is not done for cases from pattern2literals table.)

We still need both tables pattern2literals and pattern2mainLiteral, because we need to do something with non-literal regexps. Maybe pattern2literals could be generated automatically, but such a code would be quite complicated, so I think it is not worth it, to generate such a small table. And for pattern2mainLiteral I don't know how to automate its creation even in theory: we need to find a long enough literal in regexp which is present in any match. It requires in-depth analysis of the structure of regexp. And the hardcoded table is even smaller...

Do you have ideas how to do it in an elegant way?

monperrus commented 2 months ago

@starius thanks for the updates

@javierron let me know how we should proceed thanks.

javierron commented 2 months ago

Hi @starius Thanks for the response.

I mean, we could have two sets: one set for literals, which are checked against using the trie solution; and another set for actual patterns, which are checked against by using a regex chain (regex1|regex2|...|regexn). This would avoid the issue of having to update the program every time a new crawler with a pattern is added.

@monperrus Does that make any sense?

starius commented 1 month ago

Hi @javierron @monperrus !

I think it is possible to implement func analyzePattern without hardcoded tables. Go has package regexp/syntax which provides introspection of regular expressions. I think it is possible to use it to convert a pattern to the list of all possible matching strings (what pattern2literals does) and to extract longest common sub-string from a regexp (what pattern2mainLiteral does). I'll try to implement this in few days and update the PR.