ua-parser / uap-python

Python implementation of ua-parser
Apache License 2.0
561 stars 152 forks source link

Investigate skipping `re` entirely in `re2` resolver #208

Closed masklinn closed 6 months ago

masklinn commented 6 months ago

As it turns out re2.Filter compiles the patterns individually internally, and provides access to those. This means it ought be possible to extract helpers or adapt its match object in order to extract / substitute from the re2 match objects and never have to compile/match re regexes.

It's not like the re2 resolver needs to further stomp the basic resolver, but it does not hurt.

masklinn commented 6 months ago

Right so did a quick POC, helped a lot by the fact that the python bindings for re2 expose _Regexp and _Match objects which are exact matches for the stdlib, so converting the regex matching / data extraction from re to re2 was just a matter of copying some code from the matchers and doing some replacements.

However somewhat shockingly re2's regexps are nearly 100% slower than re's. re2's main claim to fame is to work in linear time and the doc specifically notes that optimistic backtracking engines will often be faster, but it's still surprising to have backtracking win out, by that much, even though re2 had already precompiled all the regexes and stuff. I guess it's possible the regex of uap-core are pretty much designed for backtracking engines, or at least have never really taken non-backtracking engines in consideration.

Here's the bench summary without and with, the -re columns are the original using re for data extraction, while the ones without use re2 for data extraction

re2-re re2-lru-re re2-s3fifo-re re2-sieve-re re2 re2-lru re2-s3fifo re2-sieve
10 33 33 32 33 59 62 62 60
20 33 32 31 31 59 62 61 60
50 33 32 30 29 59 60 57 57
100 33 31 27 27 59 60 52 53
200 33 29 25 25 59 56 49 48
500 33 25 22 21 59 49 41 41
1000 33 22 19 19 59 42 37 36
2000 33 18 16 16 59 35 29 30
5000 33 15 14 13 59 27 25 25