Closed Yuki2718 closed 4 years ago
Really sorry, I've made it in wrong place. Can you move this to uBlock issues?
Concerning the ones with underscore (_
), this was fixed in 1.28.0, see https://github.com/gorhill/uBlock/commit/01b1ed9a982965378d732ab0cb4bcd68727fe910#comments (I will reproduce here since I had a hard time finding that comment):
Sorry, bad commit message -- many obvious English typos.
Additionally, I couldn't remember what I meant to mention in the commit message, and as is often the case I remembered not long after I pushed the commit to GitHub, so here:
I found a long standing issue in how some static network filters were previously erroneously parsed, those which starts with an underscore and which were confused by uBO as pure hostname filters while they were not. Examples from EasyList:
_468.gif _468.htm _728.htm _ads.cgi _ads.html _adverts.js _rebid.js
The above filters were obviously not meant to be parsed as pure hostname filters. This has been fixed in the above commit, a filter starting with an underscore (a valid hostname character) will no longer be considered as "pure hostname". The filters above ended up being stored in an HNTrie meaning they would never match as intended by the filter author.
Another issue was the incorrect parsing of some hosts files, for example:
https://raw.githubusercontent.com/lennylxx/ipv6-hosts/master/hosts
Specifically, lines with
##
were parsed as cosmetic filter. This has also been fixed in the above commit, instances of##
(with a space afterward) will be parsed as comments.
For cases without underscores, then it's by design, see https://github.com/gorhill/uBlock/wiki/Static-filter-syntax#hosts-files:
So in uBO, any pattern which can be wholly read as a valid hostname, will be assumed to be equivalent to a filter of the form
||example.com^
.
@gorhill Thx, that cleared most part of the issue. One thing I'm still not convinced is that I hardly believe ads
or imgcache
can be a valid host name. Are there any problem to make uBO not to take a word as a host name if it is 1) a single word without period, and 2) not included in the public suffix list?
2) not included in the public suffix list?
By PSL rules, any singe word is the public suffix, even when not on the list. https://publicsuffix.org/list/
The PSL algorithm says:
- If no rules match, the prevailing rule is "*".
So ads
and imgcache
can be a TLD, as per above rule.
I think trying to change a behavior that has been in place for years, since the beginning really, is a bad idea. Filters such as ads
and imgcache
typically do not occur in filter lists, because as per ABP filter syntax, ads
is equivalent to *ads*
, which would be an inefficient, untokenizable filter and filter list maintainers know to avoid that sort of filters.
If you really want to use ads
as a anywhere-pattern in your custom filters, I was going to suggest you use /\bads\b/
, since regex syntax allows you to explicitly declare word boundaries, but I found out uBO does not deal with \b
when it tries to extract a token from a regex. I will change the code to fix this.
By PSL rules, any singe word is the public suffix, even when not on the list. https://publicsuffix.org/list/
Thanks, didn't know that.
I think trying to change a behavior that has been in place for years, since the beginning really, is a bad idea. Filters such as
ads
andimgcache
typically do not occur in filter lists, because as per ABP filter syntax,ads
is equivalent to*ads*
, which would be an inefficient, untokenizable filter and filter list maintainers know to avoid that sort of filters.If you really want to use
ads
as a anywhere-pattern in your custom filters, I was going to suggest you use/\bads\b/
, since regex syntax allows you to explicitly declare word boundaries, but I found out uBO does not deal with\b
when it tries to extract a token from a regex. I will change the code to fix this.
All right, I don't need such a rule but occasionally see such rules in minor (often low-quality) lists. Good to know it happened to improve uBO.
grep -P '^(\d+\.){3}\d+ [a-z0-9]+(\s|$)' * > ../word.txt
URL(s) where the issue occurs
https://www.shush.se/
(I forgot an actual URL I found the issue but found this for demonstration. The issue was anyway about my custom filter)Describe the issue
Despite special characters such as
?
or.
don't compose a word, omitting them from a rule leads to no matching.For the above example URL temporarily add
and see if requests
shush.se/_ads.js?
are blocked. It won't be blocked; however, any of these block:To eliminate a possibility that it's specific to
ads
, I also tested these on the same URL and they didn't block corresponding requests:but these did:
So apparently leading
/
is different from_
and the former can't be omitted if trailing special char was omitted. A question may be whether this actually matters. EasyList/EasyPrivacy usually keep trailing special characters, but I could find some exceptions e.g.though IDK if special characters are attached in corresponding requests.
Screenshot(s)
Versions
Settings
Notes