Closed domenic closed 9 months ago
As far as I can tell the only differences in forgiving-base64 are that it allows you to omit padding and to have non-zero bits in 2 or 4 "overflow" bits. I think we likely want to have a mode which supports such non-canonical input, but I'm very reluctant to make it the default: it is very common for security issues to arise from users assuming that parsers are strict about checking padding when in fact they are forgiving. I prefer to be strict by default and to be looser only as an opt-in.
Edit: handling of whitespace is also different; see discussion below.
Hmm. I'd be against shipping anything as the default which doesn't match the web platform default, i.e. which can't decode base64 data: URLs.
All other things being equal, I do think securer defaults are good.
My feelings here are that the alignment with forgiving base64 as used on the web is an empirical question: how often would a user encounter a base64 payload that requires a forgiving decoder? Any history around why such a forgiving decoder was needed in the first place? Was it a historical artifact of the web at the time, and we've largely moved on, or are such sloppy payloads still widespread?
I don't know; that'd be an interesting use counter to add. If we did so and changed the web defaults across browsers then yeah, that'd be one path.
I'd be against shipping anything as the default which doesn't match the web platform default
Why? Is there a specific confusion you are worried about which would be caused by various web platform APIs being slightly more forgiving in what they accepts than the default for this API?
how often would a user encounter a base64 payload that requires a forgiving decoder?
It's pretty rare. As far as I am aware every major programming language includes the padding by default, as do, for example, the canvas toDataURL
methods in Chrome, Safari, and Firefox. It is after all required by the RFC. And I am not aware of any way that you can end up with the extra bytes being nonzero other than a.) an actively malicious producer or b.) leaking memory.
Why? Is there a specific confusion you are worried about which would be caused by various web platform APIs being slightly more forgiving in what they accepts than the default for this API?
Yes. I think it's bad if web developers who want to implement the same semantics as browsers (pretty common, e.g. parsing data: URLs) need to reimplement the base64 decoder from scratch. Or worse, they try the language built-in one, and only after some painful experience realize that it doesn't match what they want.
I also think its bad if atob
and ArrayBuffer.fromBase64()
give different results.
As a concrete example, I think it'd be bad if packages such as https://github.com/jsdom/data-urls or similar were not able to use this language feature.
I think it's bad if web developers who want to implement the same semantics as browsers (pretty common, e.g. parsing data: URLs) need to reimplement the base64 decoder from scratch.
That's a strong argument for supporting the same semantics as used by the web platform, which I definitely agree with, but not (that I can tell) an argument for making it the default. If you try to use the strict default and it doesn't work for you, the error message could easily say "if you want to skip validation of padding, pass { padding: false }
" or something to that effect, so I don't expect it to be very painful if encountered - and as mentioned above, I don't expect it to be encountered very often except with malicious input, which is precisely the input you want to reject.
I also think its bad if atob and ArrayBuffer.fromBase64() give different results.
Well, atob
gives a string, not a buffer, so they definitely are going to give different results. Other than that, the only difference would be that atob
would accept some things that fromBase64
would reject by default. That seems like it is not so severe a problem to me, unless there is some specific confusion you are worried about.
Chiming in with a real world example: AMP has a weird protobuf which we encode into some URLs, and for some reason, the Java printer for this protobuf omits the padding characters. I have investigated, but have no idea why this happens.
Incidentally, here's two cryptographers at Google who were surprised by the non-strictness of many implementations of base64 decoding and speculated that it had probably lead to vulnerabilities: @SchmiegSophie
and @FiloSottile
.
I am pretty well convinced that we should default to the stricter thing, and have opt-in looseness for use cases such as jsdom's and AMP's which need it.
Can we have it an option for web browsers to switch the default (like in Annex B or something)? I think web platform consistency is important and I'd rather not expose a second base64 decoder on the web, but if at least we kept the default the same in this API as on the rest of the web platform (relegating the non-web variant to a non-default option) I would be able to stop from objecting to the proposal.
I would like to understand your reasoning better before commenting on that option. I didn't understand your above answer; as I said, it seemed like an argument for making the looser, less secure behavior available, but not making it the default, and I still don't understand why you want that enough to override the security concerns pointed out by cryptographers at your own employer.
Web platform consistency is important. Having different ways of performing base64 decoding exposed on the web platform is suboptimal, and making developers have to read the docs before they get consistency is IMO a nonstarter.
/cc @annevk as he has worked quite hard to get the web platform on the same page here in its various base64 APIs, and I imagine Mozilla might have similar concerns about exposing (much less making the default) one which does not match the rest of the web platform.
I agree that consistency is valuable, but this would hardly be the first time that a modern version of a legacy API took the opportunity to simultaneously choose stricter defaults. Most developers won't see a difference, and the ones who do - that is, those who actually do end up being given incorrectly padded input - should be thinking about whether they actually want to accept such inputs.
Chiming in with a real world example: AMP has a weird protobuf which we encode into some URLs, and for some reason, the Java printer for this protobuf omits the padding characters. I have investigated, but have no idea why this happens.
@jridgewell Has this been fixed since then (to use a different Java printer or something), or there's still ongoing dependence on loose decoders? How does the rest of the stack deal with the looseness?
@domenic I disagree that lack of consistency here is a hard blocker. It's a goal, other things being equal, we should shoot for. In this proposal, the other thing here is security, which is not equal, so the crux to me remains the claim from @bakkot that "most developers won't see a difference". If that turns out to be true, the stricter default seems strictly better.
I find the tweets pretty convincing evidence of the badness of the footgun. It seems totally reasonable to assume that byte-equality of base64 input implies byte-equality of the output, and its not being so is pretty bad?
That said, having a host-configurable default is a compromise I can live with, but for a proposed language facility, it seems like a good opportunity to nudge future uses away from the non-bijective footgun. Ideally, if we can get away with it the right thing here seems to be to change the default for HTML to be the bijective one as well. While we can retrofit options bags or something to atob
, how to do that for data:
URLs and the like isn't clear to me. Where in the HTML spec is forgiving-base64 decode invoked for data URLs? The only direct invocation I see in the HTML spec is in atob
, and the data:
URL RFC doesn't talk about which decoding algorithm to use.
Finally, there's been one anecdote from AMP that there's a weird protobuf they've run into. I would like to hear some more.
I ran some tests to see what the non-browser ecosystem is doing. Given the input string "SQ"
(non-padded):
Node Buffer.from("SQ", "base64")
: accepts it without complaint, giving back the right bytes
Deno standard library: returns the right bytes
js-base64: Base64.isValid()
returns true and its decode()
, atob()
, and toUint8Array()
functions return the right strings/bytes.
I think the JavaScript ecosystem is pretty uniform here as to what base64 means to them, and TC39 should not try to use its position to change what's accepted practice and defaults.
If that turns out to be true, the stricter default seems strictly better.
I don't agree with that. I don't think stricter is better here. Accepting input that the rest of the JS ecosystem accepts is a good thing and we should require it. It's a tradeoff between, IMO, needless compliance to standards that people are not following anyway, versus alignment with the ecosystem. I think theoretical purity should be lower on the priority of constituencies here, versus developers getting a uniform environment.
It seems totally reasonable to assume that byte-equality of base64 input implies byte-equality of the output, and its not being so is pretty bad?
I don't think that's a reasonable assumption, since so many implementations (including the web platform itself) it doesn't hold.
Where in the HTML spec is forgiving-base64 decode invoked for data URLs?
I think theoretical purity should be lower on the priority of constituencies here, versus developers getting a uniform environment.
It's not my reading that anyone is arguing for theoretical purity here. My reading of @bakkot's argument on the priority of constituencies is serving the end user by having a new API that's less prone to misuse insecurely (though you disagree here).
I don't think that's a reasonable assumption, since so many implementations (including the web platform itself) it doesn't hold.
My takeaway is different. Because both expecting a binary encoding to be bijective is reasonable and the reality is that it doesn't hold in JS implementations, I'm inclined to think that the sloppiness is rare in practice.
Edit: Real web devs please chime in about the reasonableness claim I make here. Seems reasonable to me, but maybe it's a gotcha everyone knows?
I would expect the byte string returned from btoa
to exactly match a byte string created from the result of ArrayBuffer.fromBase64()
for all input values. (so forgiving base64 decode, with an option to enable it).
Also ArrayBuffer.fromBase64("SG", { padding: true })
== ArrayBuffer.fromBase64("SG")
being the true, with ArrayBuffer.fromBase64("SG", { padding: false })
being the opt out is in violation of web platform design prinicples, so I would strongly oppose against this particular API. Omitting a value in an options bag should result in it being considered falsy. The argument should be switched to disablePadding
(with the boolean value inverted), if the default continues to be padding: true
.
I think the JavaScript ecosystem is pretty uniform here as to what base64 means to them
The JS ecosystem is not uniform. The base64-js package, which is substantially more commonly used than js-base64 package you link, does check padding length.
I think theoretical purity should be lower on the priority of constituencies here
As Shu says, this argument has nothing to do with theoretical purity, only about security of applications, which affects users directly.
Answering @syg's question:
Has this been fixed since then (to use a different Java printer or something), or there's still ongoing dependence on loose decoders?
Nope, still present. We're using Google's standard Java base64 encoder, but we're not calling omitPadding()
, so I don't know why it's happening. Could be present everywhere in google3 for all I know.
How does the rest of the stack deal with the looseness?
The parser we use works just fine without padding. absl's unescape has special handling for padding chars, falling into a slow path. Omitting the padding actually makes the parser faster. 🤷♂️
A number of comments have alleged that one of the options is better for security than the other, but without presenting any good evidence so far. An observation was made that the relaxed decoding is not bijective. That's true but I'm not sure what the point is because the strict decoding is not bijective either.
Is there any evidence? The only one I've seen so far is that security problems tend to arise when there are two or more subtly different ways of parsing adopted by an ecosystem, so it's usually best to stick to the dominant one.
the strict decoding is not bijective either
Isn't it? If you considering the domain to be the space of all strings which do not error when passed to the decoder and the codomain to be the space of all byte sequences, I'm pretty sure strict decoding is one-to-one (injective), and it's obviously onto (surjective). (By "strict" I mean decoding which validates padding both for the =
bytes and for the possible extra bits in the last characters, and which errors when passed a string containing characters outside the alphabet.)
The RFC talks about "canonical encoding", which certainly sounds like "the decoder is one-to-one" to me.
Is there any evidence?
Well, I linked a couple of comments from cryptographers; I'm not sure what other evidence you're looking for.
In my experience it is very common for people to write code which assumes that base64 is canonical: for example, they compare base64 strings. And it's quite easy to run into a scenario where this assumption, if it does not hold, breaks your security. (As a concrete example of how this could happen, imagine an application which keeps a list of the base64 encodings of revoked keys and checks membership of user-supplied keys in this list by string equality.) I have a much harder time coming up with an example of a plausible program where rejecting incorrectly-padded inputs leads to a security vulnerability.
@bakkot: I'm basing my point on what you wrote earlier in this thread:
"As far as I can tell the only differences in forgiving-base64 are that it allows you to omit padding and to have non-zero bits in 2 or 4 "overflow" bits."
There are two possible sources of differences:
Either one can result in non-bijective mappings.
Apologies for the confusion. In addition to ignoring incorrect padding, forgiving-base64 also strips all whitespace by default. I would prefer to reject any characters outside of the alphabet by default, and have a way to opt in to ignoring whitespace.
As I understand it browser code bases have a lot of base64[url] libraries in them and I do not think anyone has done the work of auditing all of them. (I mainly noticed when working on data:
URLs that they shared logic with atob
/btoa
.) E.g., https://w3c.github.io/push-api/ uses base64url and I'm not sure to what extent that has been tested with regards to padding. It would be interesting to see a more thorough analysis.
https://w3c.github.io/push-api/ uses base64url and I'm not sure to what extent that has been tested with regards to padding
Thanks for the example. The Push spec defers to RFC7515, which specifies encoding is to be done "using the URL- and filename-safe character set [...] with all trailing '=' characters omitted (as permitted by Section 3.2) and without the inclusion of any line breaks, whitespace, or other additional characters."
And indeed I observe that Chrome and Firefox both enforce the absence of any padding or whitespace in the applicationServerKey
. (Blink also has a test to that effect.)
Although, somewhat horrifyingly, both Chrome and Firefox accept strings which have non-zero bits in the last character, possibly because RFC7515 makes no mention of how to handle that case (it may simply not have occurred to them - they were presumably aiming for a canonical encoding, since they require the absence of whitespace, so I imagine they would have required the extra bits be zero if they'd been aware of that possibility).
I also observe that Crypto accepts JWK-formatted keys, which are defined in the above-mentioned RFC7515, and which therefore also enforce the absence of whitespace or padding. And WebAuthn (like CSP) is phrased as comparing a user-provided value against the output of running a specific encoder, which means it is strict about everything.
To summarize, among the examples discussed so far:
atob
: base64, permits both presence and absence of =
, extra bits, and whitespacedata:
uri: as above=
, agnostic on extra bits, requires absence of whitespace=
, requires absence of extra bits, requires absence of whitespace=
, requires absence of extra bits, requires absence of whitespaceI'd definitely be interested in any other examples. There does not seem to be any kind of consistency at the moment, though; the claim that "forgiving base64 decode is used by browsers for all their base64 needs" seems false.
Is it false? What you have written above makes me think that for CSP they could use the same decoder as the equality check only involves the encoder. And everything else seems to be base64url, which can indeed be different. It would be nice if they could all be the same base64url though.
I suppose it's false only if you regard base64-with-urlsafe-alphabet to be a kind of base64, but I think most people would. I do, at any rate.
What you have written above makes me think that for CSP they could use the same decoder as the equality check only involves the encoder.
It's phrased in terms of the encoder, but it could just has easily have been phrased in terms of performing a strict decoding - that is, where currently it says roughly "take the input, hash it, and apply base64 encoding to the result; take the provided string and replace -
with +
and _
with /
; accept iff the two strings are identical", it would be equivalent to say (and is probably best understood to mean) "take the input and hash it; take the provided string, replace -
with +
and _
with /
, and apply strict base64-decoding to the result (rejecting on failure); accept iff the two byte sequences are identical".
I'm looking at this from the perspective of what the minimal abstraction would be that satisfies all these callers.
It seems to me that if removal of whitespace and =
code points was a pre-processing step done by forgiving base64, it and base64url needs could share the remaining algorithm. (Assuming we continue to use the encoder for WebAuthn and CSP and I don't really see why we wouldn't.) I guess I'd lean towards exposing that through an API (and requiring those that want to match data:
URLs and atob/btoa to implement the pre-processing step).
It seems to me that if removal of whitespace and
=
code points was a pre-processing step done by forgiving base64, it and base64url needs could share the remaining algorithm.
I'd prefer to phrase the "removal of valid =
padding, if present" part as "addition of =
as necessary to have valid padding", at which point it is the decoding this proposal currently defaults to (i.e., the canonical base64 decoding).
(Except for the question of whether to reject excess non-zero bits in the last characters, though again that could be understood as forgiving-base64 performing normalization of those bits before invoking this API.)
Well, the fact that ignoring bits is also done for base64url makes me think it's across all callers, no?
It's not done for CSP or WebAuthn.
(It's also just a perverse thing to do; I'm confident it could be changed across the web platform without breaking anything. There is no reasonable way to end up with extra bits without either leaking memory or malicious input. So that is a difference which no one would notice except if it caused them to reject malicious input which they'd previously been accepting.)
Given the way CSP and WebAuthn are defined that's not observable, right?
As a user, what I see is that if I have a script containing just console.log('hi')
, and I have a CSP with script-src 'sha256-1ohZFo3B9w3UOFBbfx6JSomkpkME90iPs1r/qXzvX7Y='
, then my script will load, but if I change the trailing Y
to a Z
(thus setting an extra non-zero bit in the last character), then my script will not load.
(Per spec and Firefox; Chrome does not correctly implement the spec.)
So as a user it certainly seems observable.
If that's done by encoding console.log('hi')
and then comparing, how would you notice as it only affects the decoder? (If Chrome in fact decodes the script-src
line and then compares, that would be even more evidence that the decoders all have that behavior.)
As a user, I neither know nor care whether the specification happens to be phrased as "base64-encode the script's sha and compare to the provided string" or "strict base64-decode the provided string and compare to the script's sha".
What matters is that the behavior I actually observe is that this is a place where I am providing a base64-encoded string, and if I provide a string which has extra non-zero bits then it will not work. As such, it would be consistent with that behavior for the decoding API in this proposal to likewise reject base64-encoded strings which have extra non-zero bits.
I looked into this further, WebKit essentially has these decoding modes in terms of code paths being used:
Notably none of these enforce "Canonical Encoding" of the RFC and for base64url it doesn't enforce padding either.
Two thoughts so far:
(For encoding there are exactly two modes, Default and URL, those are hopefully less controversial.)
Ugh. Well, this is certainly a messy state of affairs.
I am, I guess, ok with having the default behavior for the decoder be to not enforce anything about padding and to strip whitespace. In principle this has weird and somewhat subtle security implications in some applications, so I'd really like to default to only allowing canonical encodings, but the fact that nothing else enforces canonicity does have weight.
@annevk
Default, enforce padding and ignore whitespace (forgiving base64)
forgiving base64 does not enforce padding (either the excess bits or the presence of =
), so I'm confused by this.
I guess it does enforce that padding =
characters, if present, are consistent - i.e. there's exactly 1 or 2 and they bring the length of the string up to a multiple of 4. Is that what you meant?
Yeah, that was sloppy wording on my part.
This conversation has mostly focused on padding, but there's also a possible whitespace consideration: https://github.com/tc39/proposal-arraybuffer-base64/pull/27#discussion_r1375304050
I find it strange that the WHATWG definition of ASCII whitespace excludes U+000B VERTICAL TABULATION, but it does, and that would now affect ECMAScript. Accepting AA\u000C==
while rejecting AA\u000B==
presumably doesn't matter much, but is still worth explicit recognition.
Right now my plan is for the default to exactly match atob
, with a strict: true
option which would enforce canonical encodings.
I don't think there's much reason to accept the weirder kinds of whitespace. I don't think anyone will literally ever run into the lack of U+000B support, so I'd prefer to stick with matching atob
.
Ended up slightly different from above, so to summarize:
lastChunkHandling: 'strict'
, then padding must be included, and the padding bits must be 0, but whitespace is still ignored (so to enforce canonical encodings you must enforce absence of whitespace yourself)
Those use https://infra.spec.whatwg.org/#forgiving-base64 . The decode there is different than RFC in that it defines error handling.
Note that forgiving base64 decode is used by browsers for all their base64 needs, e.g. data: URLs, Blobs.
IMO it would be bad to mismatch browsers.