Open mozbugbox opened 9 years ago
Hmm, why do you want to reduce the regex size? It may be better to do the opposite - the larger the regex the better because it means more work is done in C land.
Note that the recommended regex engine (the one users should really run in production) is pyre2 wrapper for re2 library; it has no problems with large regexes. For regexes with many ORs (e.g. foo|bar|baz
) re2 is not O(N) for conditions count - the engine checks all the variants at the same time unlike stdlib's re module which checks each condition individually.
Because re2 is not in python standard library, less cross platform and harder for an end user to install. If re is fast enough, there's no need to depend on re2.
I agree that using stdlib re would be better.
But if, as you said, "the giant regex will be cut into 1/3 of the original size", stdlib re probably won't be fast enough. I've observed up to 1000x speed difference for this huge regex between stdlib re and re2. If we assume re works in O(N) regarding length of the regex then reducing regex size into 1/3 of the original size will make matching of this regex 3x faster - it is not enough. We could end up with a more complex implementation which is still 300x slower than the existing re2-based.
This is only a speculation, and maybe I'm wrong.
On performance, for a url which match a rule with a domain name, the matching is reduced to urlpare url, a hash lookup for the dict and much shorter regex with most likely less than 10 combined rules, compare to likely 20,000 combined rules, for EasyList, without a hash lookup.
For rules without a domain name, the speed up will not be as much. I can see with testing "ad" and "/ad" in the url, the combined rule length can be reduced from 20,000 to 2,000+ rules joined. Of course this kind of rules are the most annoyance ones, because the false positives need to match every regex part for the rules.
With re2 it doesn't matter how many rules are combined in a regex. I like re2 because it feels right - DFA is a right algorithm / data structure for this task; hash tables + NFA-based regexes are workarounds. It is a shame that re2 is so hard to install and pyre2 pypi release is so broken.
Oh, I was wrong with a url which match rules with domain. If it does not match all the rules for that domain, it then has to be tested against rules without domain info.
And the last update for the broken pyre2 at pypi was 2011-11-15. :(
Most of filter rules are domain specific. If we put filters start with "|", "||", "@@||" into their own list of rules with a given domain, then the giant regex will be cut into 1/3 of the original size. The domain specified rules will be quite short.
Each domain can then be put into a dict mapping to the respected filter rules.
And most rest of the rules have the string "ad" in it. if we seperate rest of the rules by whether a rule has "ad" in it, we can have a much smaller regex for urls without "ad".
Since most false postive urls don't match domains and has no
ad
in them, matching false positive should be much faster.