Open Rob--W opened 1 year ago
Chrome's current implementation is hostile for anything but trivial expressions because its internal limit manifests randomly, see this example:
regexFilter: Array(111).fill('x').join('')
works,regexFilter: Array(112).fill('x').join('')
doesn't
A better approach might be to limit the average compilation + matching
time by testing each regexFilter against a set of 100-1000 URLs when the extension is installed/reloaded. The URLs may be picked from browser history + randomly generated. The results should be weighted by an internal CPU performance test to conform the results to a known scale.
It'd be also helpful if we can specify both urlFilter and regexFilter so that urlFilter
is used as a simple check before proceeding to regexFilter.
About the spec of regexpFilter
, I find that the biggest issue is that, while WebExtension a JavaScript in a JavaScript world, the regexpFilter
use a different syntax, with the limitation mentioned that are only documented in C++ code and rely on a specific implementation of the regexp engine.
Let's see why this is a problem:
isRegexpSupported
, that just translate the opaque requirement to YES or NO. Note that point 2 above isn't covered by this, because this API is in the WebExtenstion API.If you ask me, the only sensible solution here would be to make the regexpFilter
be a JavaScript regular expression.
It'd be also helpful if we can specify both urlFilter and regexFilter so that
urlFilter
is used as a simple check before proceeding to regexFilter.
This has been requested before, e.g. at crbug.com/1249915. I'm not opposed, but support for it would not make it easier to arrive at a decision for the desired syntax of regexFilter
. The constrains for regexFilter
are largely influenced by performance considerations. From that perspective:
urlFilter
is easier to optimize than regexFilter
, due to the limited syntax of urlFilter
compared to the expressive syntax of regexFilter
. Supporting them together will only improve performance if the urlFilter is sufficiently specific that it would eliminate DNR rules from matching attempts. E.g. urlFilter: "*"
or urlFilter: "http"
may as well be omitted because they match any input. On the other hand, if the typical usage is more like urlFilter: "||example.com", regexFilter: "somecomplexregexhere"
, then it could improve the runtime performance of DNR+regexFilter.
- How do you handle the situation where an extension developer want to provide users with tools to validate DNR rules? There is no RE2 engine in the browser. There is just an API
isRegexpSupported
, that just translate the opaque requirement to YES or NO. Note that point 2 above isn't covered by this, because this API is in the WebExtenstion API.
The isRegexpSupported
method returns whether the regexFilter
were to be accepted. This method itself is helpful, along with other APIs that extensions can use to validate DNR rules (declarativeNetRequest.onRuleMatchedDebug
, declarativeNetRequest.testMatchOutcome
, declarativeNetRequest.getMatchedRules
). Extensions wishing to validate complete DNR rules (not just regexFilter
) can use declarativeNetRequest.updateSessionRules
, which rejects with error messages if the input is considered invalid.
These validation tools being part of the extension API is a non-issue. The lack of clarity on the syntax of regexFilter
is the problem here - hence this issue.
If you ask me, the only sensible solution here would be to make the
regexpFilter
be a JavaScript regular expression.
I can see how attractive that is to extension developers and filter list maintainers, because that format is somewhat well-documented and generally well understood. That advantage in the API design should also be weighted against the runtime impact when the API is used in practice. Besides memory cost, even seemingly small regular expressions can result in catastrophic performance (ReDoS).
If we are able to come up with a subset of regular expressions that list authors (and other DNR extensions) actually need, then we can try to reach consensus on the desired syntax that are supported by the DNR API across all browsers. A limited target is easier to optimize for than something as broad as "JS regex". For example, unicode
does not make any sense in regexFilter
, because a normalized URL cannot have non-ASCII characters. Unicode characters in canonical URLs are punycode-encoded (hosts) or URL-encoded.
@gorhill shared a list of regular expressions that are incompatible with Chrome's current declarativeNetRequest regular expression handling: https://github.com/uBlockOrigin/uBOL-issues/issues/15#issuecomment-1332760988
The rejected regexes are reported at the dev tools console of uBO Lite when installing or reloading the extension. I am not at the desktop computer at the moment.
@dotproto Here are the regexes currently rejected by the DNR engine at runtime -- none of these have lookaround part since such regex filters are discarded at compile time:
(https?:\/\/)104\.154\..{100,} (https?:\/\/)104\.197\..{100,} (https?:\/\/)104\.198\..{100,} (https?:\/\/)130\.211\..{100,} (https?:\/\/)142\.91\.159\..{100,} (https?:\/\/)213\.32\.115\..{100,} (https?:\/\/)216\.21\..{100,} (https?:\/\/)217\.182\.11\..{100,} (https?:\/\/)51\.195\.31\..{100,} ^https?:\/\/.*\.(club|news|live|online|store|tech|…pace|network|live|work|systems|ml|world|life)\/.* ^https?:\/\/.*\/[a-z0-9A-Z_]{2,15}\.(php|jx|jsx|1ph|jsf|jz|jsm|j$) ^https:\/\/[a-z]{2,14}\.wp\.pl\/[a-zA-Z0-9_-]{200,915}$ ^https?:\/\/www\.myservices\.equifax\.ca\/TSPD\/[0-9a-f]{30,} \/[0-9a-z]{8,10}\?shu=[0-9a-z]{150,} \/(?:[0-9a-z]{7,25}-){9,13}[0-9a-z]{10,15}\/(?:[0-9a-z]+\/)+index\.php ^https:\/\/[0-9a-z]{3,}\.[-a-z]{10,}\.(?:li[fv]e|top|xyz)\/[a-z]{8}\/\?utm_campaign=\w{40,} \/[A-Z]\/[-0-9a-z]{5,}\.com\/(?:[0-9a-f]{2}\/){3}[0-9a-f]{32}\.js$ \.com\/[-_0-9a-zA-Z]{4,}\/[-\/_0-9a-zA-Z]{25,}$ ^https?:\/\/.*\/sw\.js\?[a-zA-Z0-9%]{50,} ^https?:\/\/.*\.(club|bid|biz|xyz|site|pro|info|on…|space|network|live|systems|ml|world|life|co)\/.* ^https?:\/\/.*(com|net|top|xyz)\/(bundle|warning|s…vicon|star|header)\.(png|css)\?[A-Za-z0-9]{30,}.* ^https?:\/\/[a-z]{8,15}\.top\/[-a-z]{4,}\.css\?aHR0c[\/0-9a-zA-Z]{33,}=?=?$ ^https?:\/\/[a-z]{8,15}\.xyz\/[-a-z]{4,}\.css\?aHR0c[\/0-9a-zA-Z]{33,}=?=?$ ^https?:\/\/[a-z]{8,15}\.top\/[a-z]{4,}\.png\?aHR0c[\/0-9a-zA-Z]{33,}=?=?$ ^https?:\/\/[a-z]{8,15}\.xyz\/[a-z]{4,}\.png\?aHR0c[\/0-9a-zA-Z]{33,}=?=?$ ^https?:\/\/(?:www\.|[0-9a-z]{7,10}\.)?[-0-9a-z]{5,}\.com\/(?:[0-9a-f]{2}\/){2,3}[0-9a-f]{32}\.js ^https:\/\/[0-9a-f]{10}\.[0-9a-f]{10}\.com\/[0-9a-f]{32}\.js$ ^https?:\/\/[-a-z]{6,16}\.com?\/[a-d][-\.\/_A-Za-z…_0-9a-zA-Z][-\.\/_A-Za-z][-\/_0-9a-zA-Z]{22,162}$ ^https?:\/\/[-a-z]{6,16}\.info\/[a-d][-\.\/_A-Za-z…_0-9a-zA-Z][-\.\/_A-Za-z][-\/_0-9a-zA-Z]{22,162}$ ^https?:\/\/[-a-z]{6,16}\.pro\/[a-d][-\.\/_A-Za-z]…_0-9a-zA-Z][-\.\/_A-Za-z][-\/_0-9a-zA-Z]{22,162}$ ^https?:\/\/[-a-z]{6,16}\.xyz\/[a-d][-\.\/_A-Za-z]…_0-9a-zA-Z][-\.\/_A-Za-z][-\/_0-9a-zA-Z]{22,162}$ ^https:\/\/cdn\.jsdelivr\.net\/npm\/[-a-z_]{4,22}@latest\/dist\/script\.min\.js$ (https?:\/\/)\w{30,}\.me\/\w{30,}\. ^https?:\/\/s3\.*.*\.amazonaws\.com\/[a-f0-9]{45,}\/[a-f,0-9]{8,10}$ ^https?:\/\/[0-9a-f]{50,}\.s3\.amazonaws\.com\/[0-9a-f]{10}$ ^https?:\/\/s3\.us-east-1\.amazonaws\.com\/[0-9a-f]{50,}\/[0-9a-f]{10}$ ^https?:\/\/[a-z]{8,15}\.top(\/(?:\d{1,5}|0NaN|art…|pages?|static|view|web|wiki)){1,4}(?:\.html|\/)$ ^https?:\/\/[a-z]{8,15}\.xyz(\/(?:\d{1,5}|0NaN|art…|pages?|static|view|web|wiki)){1,4}(?:\.html|\/)$ ^https?:\/\/[-0-9a-z]{5,}\.com\/[0-9a-z]{8,10}\?key=[0-9a-f]{32}$
Here a full list of unsupported RegExps (with errors codes from Chrome) from our AdGuard filters:
We set some limits (chosen empirically) for RegExps groups and maximum length here to try to prevent memory overflow.
It looks like last report was a bit out of date, so I double checked it in the latest version of chrome (version 110.0.5481.96 arm64) with our latest filters. Here it is:
First of all, these are regular expressions with negative lookahead.
Some of the rules can be rewritten using excludedRequestDomains
Most of the rules cannot be converted and it’s unclear how to rewrite them in order to fix the error.
We thought there might be some sort of global flag "/g" for the RE2 syntax, because it seems that in the PCRE engine the global flag allows you to increase the range group by several orders of magnitude, but we didn't find a global flag here.
In the case of Safari there are many more invalid regular expressions in the filters.
The most painful one is that Safari does not support {n,m}
quantifiers and |
.
Maybe instead of discussing just Chrome, we could use Safari as a base line and demonstrate what else we would like to have and why?
Here's how negative lookahead is typically used.
A rule would look like this:
/^(?!.*(allowed.com|allowed2.com)/$domain=website.com
This use case is covered by excludedRequestDomains
.
Here's a typical rule for this use case:
/yenihaberden.com\/d\/other\/(?!yeni-haber-youtube)/$domain=yenihaberden.com
It's tempting to rewrite it into a pair of rules (blocking + unblocking):
||yenihaberden.com/d/other/
@@||yenihaberden.com/d/other/yeni-haber-youtube/
However, it is not exactly the same as it unblocks the second location completely. Filter maintainers usually resort to using negative lookahead when completely unblocking something is a problem.
If we had something like excludedUrlFilter
, this would completely eliminate the need for negative lookahead.
{n,m}
quantifiermemoryLimitExceeded
There're quite a lot of regex rules that rely on this quantifier and some of them don't work even in Chrome due to rather strict memory limit.
I wish Safari added support for this quantifier. To Chrome people: please consider increasing the limit so that the rules from above could compile.
|
Again, lots of regex rules that use this. Missing support in Safari is problematic.
WebKit bug tracking this issue: https://bugs.webkit.org/show_bug.cgi?id=260970
I'd just like to add a specific issue I'm running into with this not being specified properly.
As an extension developer using DNR, the majority of rules aren't dynamic. They're static, and need to be bundled together with the extension. Doing that bundling is typically done through some good old command line build tools, like Webpack.
However, at the point of doing that bundling, there isn't really a good way to check the regex format. It isn't well specified as this issue says. The DNR APIs to check if it's valid like isRegexSupported
can only be used from within an MV3 extension. This means that there isn't a good way to make sure that we're only bundling valid rules with our extension.
This similarly makes it difficult for any filter list authors to have, for example, CI jobs that ensure that regex filters are valid.
At eyeo, we literally maintain a Node package which uses the C++ native extension that imports RE2 and recreates the memory limits used to see if Chrome will call a regex valid (https://www.npmjs.com/package/@eyeo/abp2dnr). This is of course quite fragile because the limits aren't specified, they're just implementation details. Any changes to RE2 or Chrome could bring our regex validation out of sync with Chrome's actual validation.
I noticed only negative lookahead was mentioned in the issue so just in case want to note that we actively use positive lookahead too.
The DNR API currently has three primary ways to match requests by their URL:
urlFilter
- match URL by a literal substring with optional wildcards; Chrome's docs for urlFilter, and also Firefox's source code docs for some undocumented cases.requestDomains
- match domain or superdomain by a literal string; Chrome's docs for requestDomains.regexFilter
- match URL by a regular expression; Chrome's docs for regexFilter.The format of
regexFilter
is poorly specified. This issue is about coming up with whatregexFilter
should support, potentially beyond what the current implementations offer, following the 2023-01-19 WECG meeting (meeting notes will be available once #343 is merged).In Chrome,
regexpFilter
is basically the syntax of the underlying RE2 library plus the additional implementation-dependent constraint that the memory usage of an individual regex may not exceed 2kb (source). Chrome heavily relied on the RE2 library for its implementation, not just for matching, but also for extra optimizations (source code comment in regex_rules_matcher.h). Chrome limits the number of regexFilter rules to 1000.In Safari, all DNR rules are internally converted to regular expressions and the maximum number of supported rules is 150k, without a specific smaller limit of regexFilter rules. Its supported regexp syntax is documented at https://developer.apple.com/documentation/safariservices/creating_a_content_blocker#3030754 . This linked documentation is not Safari's DNR docs, but the underlying Content blocker API that Safari uses. There is a human-readable high-level description of this content blocker API at https://webkit.org/blog/4062/targeting-domains-with-content-blockers/ . Internally, Safari compiles the regular expressions to a set of deterministic finite automatons (DFA). These DFAs are optimized and compiled to bytecode. An interpreter uses this bytecode to find the desired actions for a given URL. This implementation is not a generic regexp engine, so any regexp feature needs to be carefully examined before it can be supported.
In Firefox,
regexFilter
is not implemented yet. Ideally we can reach a resolution here so that extensions don't have to encounter breaking changes.In this issue, the goal is to determine the desired syntax of
regexFilter
, in a way that is useful to extension developers and feasible to implement optimally in web browsers.