scrapinghub / adblockparser

Python parser for Adblock Plus filters
MIT License
193 stars 29 forks source link

Group filters into sub groups #2

Open mozbugbox opened 9 years ago

mozbugbox commented 9 years ago

Most of filter rules are domain specific. If we put filters start with "|", "||", "@@||" into their own list of rules with a given domain, then the giant regex will be cut into 1/3 of the original size. The domain specified rules will be quite short.

Each domain can then be put into a dict mapping to the respected filter rules.

And most rest of the rules have the string "ad" in it. if we seperate rest of the rules by whether a rule has "ad" in it, we can have a much smaller regex for urls without "ad".

domain_rules = {
"host1": merged_regex,
"host2": merged_regex
...
}
rules_with_ad = merged_regex
rules_without_ad = merged_regex

Since most false postive urls don't match domains and has no ad in them, matching false positive should be much faster.

kmike commented 9 years ago

Hmm, why do you want to reduce the regex size? It may be better to do the opposite - the larger the regex the better because it means more work is done in C land.

Note that the recommended regex engine (the one users should really run in production) is pyre2 wrapper for re2 library; it has no problems with large regexes. For regexes with many ORs (e.g. foo|bar|baz) re2 is not O(N) for conditions count - the engine checks all the variants at the same time unlike stdlib's re module which checks each condition individually.

mozbugbox commented 9 years ago

Because re2 is not in python standard library, less cross platform and harder for an end user to install. If re is fast enough, there's no need to depend on re2.

kmike commented 9 years ago

I agree that using stdlib re would be better.

But if, as you said, "the giant regex will be cut into 1/3 of the original size", stdlib re probably won't be fast enough. I've observed up to 1000x speed difference for this huge regex between stdlib re and re2. If we assume re works in O(N) regarding length of the regex then reducing regex size into 1/3 of the original size will make matching of this regex 3x faster - it is not enough. We could end up with a more complex implementation which is still 300x slower than the existing re2-based.

This is only a speculation, and maybe I'm wrong.

mozbugbox commented 9 years ago

On performance, for a url which match a rule with a domain name, the matching is reduced to urlpare url, a hash lookup for the dict and much shorter regex with most likely less than 10 combined rules, compare to likely 20,000 combined rules, for EasyList, without a hash lookup.

For rules without a domain name, the speed up will not be as much. I can see with testing "ad" and "/ad" in the url, the combined rule length can be reduced from 20,000 to 2,000+ rules joined. Of course this kind of rules are the most annoyance ones, because the false positives need to match every regex part for the rules.

kmike commented 9 years ago

With re2 it doesn't matter how many rules are combined in a regex. I like re2 because it feels right - DFA is a right algorithm / data structure for this task; hash tables + NFA-based regexes are workarounds. It is a shame that re2 is so hard to install and pyre2 pypi release is so broken.

mozbugbox commented 9 years ago

Oh, I was wrong with a url which match rules with domain. If it does not match all the rules for that domain, it then has to be tested against rules without domain info.

mozbugbox commented 9 years ago

And the last update for the broken pyre2 at pypi was 2011-11-15. :(