serverless-dns / blocklists

An opinionated collection of blocklists for RethinkDNS.
https://rethinkdns.com/configure
Mozilla Public License 2.0
85 stars 25 forks source link

Incorrect behavior from ABP regex #48

Open afontenot opened 2 years ago

afontenot commented 2 years ago

When using the EasyPrivacy list, the resulting domain set includes soundcloud.com, which is incorrect.

This is the result of the rule: ||soundcloud.com^$ping

Test case:


import re

# https://github.com/serverless-dns/blocklists/blob/f57b1d/download.py#L206
rexpr = re.compile(r"^(\|\||[a-zA-Z0-9])([a-zA-Z0-9][a-zA-Z0-9-_.]+)((\^[a-zA-Z0-9\-\|\$\.\*]*)|(\$[a-zA-Z0-9\-\|\.])*|(\\[a-zA-Z0-9\-\||\^\.]*))$", re.M)

txt = "||soundcloud.com^$ping"

print(re.search(rexpr, txt).groups())

Result:

('||', 'soundcloud.com', '^$ping', '^$ping', None, None)
ignoramous commented 2 years ago

Thanks.

This is the result of the rule: ||soundcloud.com^$ping

Taking domains out of ABP was always going to be tricky. I don't know why we do it... but here we are. I am unfamiliar with the ABP format (only Santosh has worked on this code), and so don't mind me asking... what'd you expect in the final entry be when something like ||soundcloud.com^$ping exists in the original list?

afontenot commented 2 years ago

If I understand what this code is supposed to be doing correctly, I'd expect the output to be blank, because the regex shouldn't be matching this filter at all. This code is supposed to extract only cases where the entire domain is supposed to be blocked, but ABP filters support much more.

So for example, the regex (correctly) does not match the following filter from the same list:

||discovery.com^*/events

The soundcloud filter is supposed to block only ping requests, i.e. requests that originate from <a ping or Navigator.sendBeacon(). A domain-based blocklist can't block these requests, and so the domain should be left out of the list entirely.

An important corner case is what you should do with ^$third-party blocklist entries (which are also currently included because of the regex). Technically, these domains are not supposed to be blocked if the user visits them directly. But the overwhelming majority seem to be tracking domains that it's hard to imagine anyone ever visiting deliberately.

I think I lean towards leaving them out and anticipating users who want to block domains like this to use a more thorough, domain-oriented blocklist.

ignoramous commented 1 year ago

@afontenot Is this a desirable fix?

import re

rexpr = re.compile(r"^(\|\||[a-zA-Z0-9])([a-zA-Z0-9][a-zA-Z0-9-_.]+)((\^[a-zA-Z0-9\-\|\$\.\*]*)|(\$[a-zA-Z0-9\-\|\.])*|(\\[a-zA-Z0-9\-\||\^\.]*))$", re.M)

txt = "||soundcloud.com^$ping"
# or: ||soundcloud.com^

# ('||', 'soundcloud.com', '^$ping', '^$ping', None, None)
# or: ('||', 'soundcloud.com', '^', '^', None, None)
g = re.search(rexpr, txt).groups()

if g is not None and len(g) >= 2:
    if len(g) >= 3 and g[3] is not None and len(g[3]) > 1:
        # is an abp url entry
        continue
    else:
        # there's but just the domain name
        domain = g[1]
else:
    # no matches
    continue
afontenot commented 1 year ago

@ignoramous That does look like it would solve the problem, but if I understand the intent correctly and the limitations of domain based blocking versus the ABP syntax, I don't see why you wouldn't just drastically simplify the regex like so:

import re

rexpr = re.compile(r"^\|\|[^/\n]+\^$", re.M)

txt = """||soundcloud.com^$ping
||soundcloud.com^
||domain1.com^
.soundcloud.com^
||soundcloud.com/path/^
||subdomain.domain2.com^"""

assert rexpr.findall(txt) == ['||soundcloud.com^', '||domain1.com^', '||subdomain.domain2.com^']

Seems to work okay. Maybe there are some corner cases I'm not considering?

ignoramous commented 1 year ago

I don't see why you wouldn't just drastically simplify the regex like so

LGTM. Thanks. Want to send a pull request? (:

The current regex, I beleive, also checks if the entries are valid abp syntax. Though, I'm all for simplicity.

ignoramous commented 1 year ago

Added: https://github.com/serverless-dns/blocklists/commit/de0db78de262a2ca5d588e415523617562ec726e#diff-0754e2389b0a4982ce0106c260cf218699e6f2061105043cc24acc7398d1d38bR273

Hopefully, it works just fine.

Goes without saying: Thanks a bunch for your inputs, appreciate it (:

ignoramous commented 1 year ago

Reopening, since no matter the regex I try, some list or the other trips it up. In the interim, I've replaced one or two abp lists with their hostfile equivalent.

afontenot commented 1 year ago

If you have an example of a broken filter, I could have a look at it.

ignoramous commented 1 year ago

I think I am done with the blocklist rewrite now and can focus on tests.

If you have an example of a broken filter, I could have a look at it.

There's a bunch that break. I tried iterating upon the abp regex, but my python (and regex) skills eventually failed me as the failures kept appearing for different abp files. I am going to automate a test for this. Give me a few days, please.

ignoramous commented 4 days ago

A user writes,

Reproduce:

Expected:

Actual:

Presumed reason: The block list includes only one related entry:

I assume the ping option of this entry is misinterpreted and incorrectly blocks the entire domain.

Workaround: Trusting the individual domains sometimes helps, but sometimes doesn't. Sometimes the request appears in the log as maybe blocked, which I don't know the meaning of.

I'm not familiar with block list syntax, but the related ping type option is described here: https://help.adblockplus.org/hc/en-us/articles/360062733293-How-to-write-filters#type-options

Suggested change:


From: https://github.com/celzero/rethink-app/issues/1713#issuecomment-2390998554

boernsen-development commented 4 days ago

Thanks for forwarding.

I would like to add that I think using pure hostlist equivalents for blocklists (as you attempted already, @ignoramous), if existing, would be preferable as this avoids dealing with unsupported syntax at all. I don't know whether these exist for the block lists currently provided though.

For the EasyPrivacy list, I expected to find it here, but I couldn't figure out which one it is or if there is any.

Anyway, for future support of custom block lists (if planned), it would still be an advantage if only pure domains would be imported.