w3c / webextensions

Charter and administrivia for the WebExtensions Community Group (WECG)
Other
576 stars 50 forks source link

Proposal: Decode `regexFilter` matches in Declarative Net Request as query parameters #636

Open oliverdunk opened 2 weeks ago

oliverdunk commented 2 weeks ago

As mentioned in https://github.com/w3c/webextensions/issues/302, many sites add an additional redirect to external links. These interstitials often include the destination URL as a query parameter with URI component encoding:

Examples

https://www.google.com/url?q=https%3A%2F%2Fexample.com%2F%3Ftest https://l.facebook.com/l.php?u=https%3A%2F%2Fexample.com%2F%3Ftest https://steamcommunity.com/linkfilter/?u=https%3A%2F%2Fexample.com%2F%3Ftest https://forum.donanimhaber.com/externallinkredirect?url=https%3A%2F%2Fexample.com%2F%3Ftest

Current Situation

A simple browser.declarativeNetRequest rule to skip these redirects would be as follows:

{
  "id": 1,
  "action": {
    "type": "redirect",
    "redirect": {
      "regexSubstitution": "\\1"
    }
  },
  "condition": {
    "regexFilter": "https://steamcommunity.com/linkfilter/\\?u=([^&]*).*",
    "resourceTypes": ["main_frame"]
  }
}

However, this does not decode the query parameter and therefore fails to properly redirect. For example, visiting https://steamcommunity.com/linkfilter/?u=https://example.com?test%26hello does not properly decode the %26 in the destination URL.

Proposal

Add a decodeRegexMatches property in the Redirect or RuleCondition dictionaries. This would process each match through decodeURIComponent before performing the regexSubstitution.

Limitations

This does not address all possible types of encoding. However, I wasn't able to find many exceptions - only one in fact - and this covers the sites which extensions like Privacy Badger were handling in Manifest V2.

Ideally we would have a more complete API, supporting things like only decoding certain matches or applying different types of processing, perhaps in a given order. However, this proposal was written with the hope of keeping the implementation complexity down so this can move forward sooner.

ghostwords commented 2 weeks ago

Thanks for the proposal! This is a solution to the problem in #302.

Add a decodeRegexMatches property in the Redirect or RuleCondition dictionaries.

decodeRegexMatches should probably live together with regexSubstitution, because that's the one place it matters. So, add the property to the Redirect dictionary only?

Also, for some future-proofing, we should have decodeRegexMatches take an array of strings, where at the moment there is one possible value:

    "redirect": {
      "regexSubstitution": "\\1",
      "decodeRegexMatches": ["decodeURIComponent"]
    }

This looks good to me otherwise.

ghostwords commented 2 weeks ago

This would process each match through decodeURIComponent before performing the regexSubstitution.

This would process each captured group referenced by regexSubstitution?

oliverdunk commented 2 weeks ago

So, add the property to the Redirect dictionary only?

No strong feelings on this, I think that's a reasonable justification.

Also, for some future-proofing, we should have decodeRegexMatches take an array of strings

I also don't feel strongly here. If we don't do it to begin with, I think we could quite easily add that later without causing too many issues.

This would process each captured group referenced by regexSubstitution?

Right, so \\1 would now be the equivalent of decodeURIComponent(\\1). I think that is sufficient but URL encoding is hard so if anyone notices certain decoding we would want to do differently, definitely keen to hear it. Ultimately I think we just want to match the encoding these sites are using.

tophf commented 2 weeks ago

It might make sense to indicate the specific captured groups that need to be decoded e.g. "decodeURL": [1].

P.S. Suggesting to replace Declarative Net Request with declarativeNetRequest in the description to make this issue discoverable via search. Might also make sense to add a DNR label to all issues that contain Declarative, DNR, declarativeNetRequest.

oliverdunk commented 2 weeks ago

Thanks for the thoughts @tophf!

It might make sense to indicate the specific captured groups that need to be decoded e.g. "decodeURL": [1].

I briefly mention this under "Limitations". I'm supportive of the ability to do that, but I think we might want to leave it out for the MVP. Otherwise it opens other questions (can you apply different decodings to different matches etc.) and I think this quickly becomes a larger proposal.

Updated the description and added an existing DNR label.

Rob--W commented 2 weeks ago

Add a decodeRegexMatches property in the Redirect or RuleCondition dictionaries.

This should be part of rule.action.redirect, not of rule.condition. The only way that it could influence the condition is if the resulting transformation would become an invalid URL, in which case you may want to fall back. But the possibility of 30000 "try to transform the URL and fall back to the next" seems very heavy, whereas I would be more comfortable with implementation complexity (and associated memory/CPU) when it is part of the (final) (redirect) action.

To enable future extensions, I would recommend a syntax that allows specifying separate substitutions for each individual match. The bare minimum is to support this through numbered capturing groups (\\1 in your example). Another option could be to support this through named capturing groups ((?<name>pattern)). Named capturing groups is supported by RE2 (source: re2 wiki: Syntax, used by Chromium) and RegExp in JavaScript (source: mdn: Named capturing group, used by Firefox (source) and Safari (source)). Note: the named capturing group feature specifies the syntax used for the matching part, not for the substitution. For substitution, you could draw inspiration from the named backreference syntax, \k<name> (mdn). Although documented as not supported by RE2, this does not have to be an implementation-limiting factor, because the actual backreference substitution can be implemented independently of the regular expression engine (which Firefox and Safari do, see the sources linked before). Chrome does currently use RE2's replace implementation (source), but does not have to. In fact, if it wanted to implement the capability requested here, it most likely has to re-implement substitution independently of the RE2 engine.

Here is an example that can be extended in the future. I'm suggesting an array syntax here instead of a dictionary in case (future) operations want to be dependent on the result of previous transformations.

    "redirect": {
      "regexSubstitution": "\\1",
      "regexSubstitutionTransforms": [
        { "group": 1, "operation": "decodeURIComponent" }
      ]
    }