scrapinghub / adblockparser

Python parser for Adblock Plus filters
MIT License
193 stars 29 forks source link

rule with single negated domain not matched correctly #1

Closed nbraem closed 9 years ago

nbraem commented 9 years ago

A rule with a single negated domain, like so: "adv$domain=~example.com" is very common, but is not matched correctly.

Here's a snippet to make the tests fail for file "test/test_parsing.py":

    "adv$domain=~example.com": [
        ("http://example.net/adv", {'domain': 'otherdomain.com'}, True),
        ("http://somewebsite.com/adv", {'domain': 'example.com'}, False),
    ],
nbraem commented 9 years ago

How about adding the options to the regex? When matching you could build 1 string that includes the specified options and then just match it with the 1 big regex for all the rules?

Here's what I mean for the rule "adv$domain=~example.com":

# matches
re.match(r'u<.*adv.*>d<.*?(?<!example\.com)>', 'u<http://example.net/adv>d<otherdomain.com>')
# does not match
re.match(r'u<.*adv.*>d<.*?(?<!example\.com)>', 'u<http://example.net/adv>d<example.com>')
kmike commented 9 years ago

@nbraem thanks for the report, I'll take a look.

How about adding the options to the regex?

What are the benefits of creating such regexes?

nbraem commented 9 years ago

Benefit is that you don't need additional logic to match the options rule by rule.

kmike commented 9 years ago

Sorry for a long delay and thanks for the bug report! It should be fixed now.