Open afontenot opened 2 years ago
Thanks.
This is the result of the rule: ||soundcloud.com^$ping
Taking domains out of ABP was always going to be tricky. I don't know why we do it... but here we are. I am unfamiliar with the ABP format (only Santosh has worked on this code), and so don't mind me asking... what'd you expect in the final entry be when something like ||soundcloud.com^$ping
exists in the original list?
If I understand what this code is supposed to be doing correctly, I'd expect the output to be blank, because the regex shouldn't be matching this filter at all. This code is supposed to extract only cases where the entire domain is supposed to be blocked, but ABP filters support much more.
So for example, the regex (correctly) does not match the following filter from the same list:
||discovery.com^*/events
The soundcloud filter is supposed to block only ping requests, i.e. requests that originate from <a ping
or Navigator.sendBeacon()
. A domain-based blocklist can't block these requests, and so the domain should be left out of the list entirely.
An important corner case is what you should do with ^$third-party
blocklist entries (which are also currently included because of the regex). Technically, these domains are not supposed to be blocked if the user visits them directly. But the overwhelming majority seem to be tracking domains that it's hard to imagine anyone ever visiting deliberately.
I think I lean towards leaving them out and anticipating users who want to block domains like this to use a more thorough, domain-oriented blocklist.
@afontenot Is this a desirable fix?
import re
rexpr = re.compile(r"^(\|\||[a-zA-Z0-9])([a-zA-Z0-9][a-zA-Z0-9-_.]+)((\^[a-zA-Z0-9\-\|\$\.\*]*)|(\$[a-zA-Z0-9\-\|\.])*|(\\[a-zA-Z0-9\-\||\^\.]*))$", re.M)
txt = "||soundcloud.com^$ping"
# or: ||soundcloud.com^
# ('||', 'soundcloud.com', '^$ping', '^$ping', None, None)
# or: ('||', 'soundcloud.com', '^', '^', None, None)
g = re.search(rexpr, txt).groups()
if g is not None and len(g) >= 2:
if len(g) >= 3 and g[3] is not None and len(g[3]) > 1:
# is an abp url entry
continue
else:
# there's but just the domain name
domain = g[1]
else:
# no matches
continue
@ignoramous That does look like it would solve the problem, but if I understand the intent correctly and the limitations of domain based blocking versus the ABP syntax, I don't see why you wouldn't just drastically simplify the regex like so:
import re
rexpr = re.compile(r"^\|\|[^/\n]+\^$", re.M)
txt = """||soundcloud.com^$ping
||soundcloud.com^
||domain1.com^
.soundcloud.com^
||soundcloud.com/path/^
||subdomain.domain2.com^"""
assert rexpr.findall(txt) == ['||soundcloud.com^', '||domain1.com^', '||subdomain.domain2.com^']
Seems to work okay. Maybe there are some corner cases I'm not considering?
I don't see why you wouldn't just drastically simplify the regex like so
LGTM. Thanks. Want to send a pull request? (:
The current regex, I beleive, also checks if the entries are valid abp
syntax. Though, I'm all for simplicity.
Hopefully, it works just fine.
Goes without saying: Thanks a bunch for your inputs, appreciate it (:
Reopening, since no matter the regex I try, some list or the other trips it up. In the interim, I've replaced one or two abp
lists with their hostfile
equivalent.
If you have an example of a broken filter, I could have a look at it.
I think I am done with the blocklist rewrite now and can focus on tests.
If you have an example of a broken filter, I could have a look at it.
There's a bunch that break. I tried iterating upon the abp
regex, but my python (and regex) skills eventually failed me as the failures kept appearing for different abp
files. I am going to automate a test for this. Give me a few days, please.
A user writes,
Reproduce:
Expected:
Actual:
*.wo-cloud.com
requests getting blocked.Presumed reason: The block list includes only one related entry:
||wo-cloud.com^$ping
I assume the ping
option of this entry is misinterpreted and incorrectly blocks the entire domain.
Workaround:
Trusting the individual domains sometimes helps, but sometimes doesn't. Sometimes the request appears in the log as maybe blocked
, which I don't know the meaning of.
I'm not familiar with block list syntax, but the related ping type option is described here: https://help.adblockplus.org/hc/en-us/articles/360062733293-How-to-write-filters#type-options
Suggested change:
From: https://github.com/celzero/rethink-app/issues/1713#issuecomment-2390998554
Thanks for forwarding.
I would like to add that I think using pure hostlist equivalents for blocklists (as you attempted already, @ignoramous), if existing, would be preferable as this avoids dealing with unsupported syntax at all. I don't know whether these exist for the block lists currently provided though.
For the EasyPrivacy list, I expected to find it here, but I couldn't figure out which one it is or if there is any.
Anyway, for future support of custom block lists (if planned), it would still be an advantage if only pure domains would be imported.
When using the EasyPrivacy list, the resulting domain set includes
soundcloud.com
, which is incorrect.This is the result of the rule:
||soundcloud.com^$ping
Test case:
Result: