How does the base64 chosen here compare to atob/btoa?

domenic commented 3 years ago

Those use https://infra.spec.whatwg.org/#forgiving-base64 . The decode there is different than RFC in that it defines error handling.

Note that forgiving base64 decode is used by browsers for all their base64 needs, e.g. data: URLs, Blobs.

IMO it would be bad to mismatch browsers.

bakkot commented 3 years ago

As far as I can tell the only differences in forgiving-base64 are that it allows you to omit padding and to have non-zero bits in 2 or 4 "overflow" bits. I think we likely want to have a mode which supports such non-canonical input, but I'm very reluctant to make it the default: it is very common for security issues to arise from users assuming that parsers are strict about checking padding when in fact they are forgiving. I prefer to be strict by default and to be looser only as an opt-in.

Edit: handling of whitespace is also different; see discussion below.

domenic commented 3 years ago

Hmm. I'd be against shipping anything as the default which doesn't match the web platform default, i.e. which can't decode base64 data: URLs.

syg commented 3 years ago

All other things being equal, I do think securer defaults are good.

My feelings here are that the alignment with forgiving base64 as used on the web is an empirical question: how often would a user encounter a base64 payload that requires a forgiving decoder? Any history around why such a forgiving decoder was needed in the first place? Was it a historical artifact of the web at the time, and we've largely moved on, or are such sloppy payloads still widespread?

domenic commented 3 years ago

I don't know; that'd be an interesting use counter to add. If we did so and changed the web defaults across browsers then yeah, that'd be one path.

bakkot commented 3 years ago

I'd be against shipping anything as the default which doesn't match the web platform default

Why? Is there a specific confusion you are worried about which would be caused by various web platform APIs being slightly more forgiving in what they accepts than the default for this API?

how often would a user encounter a base64 payload that requires a forgiving decoder?

It's pretty rare. As far as I am aware every major programming language includes the padding by default, as do, for example, the canvas toDataURL methods in Chrome, Safari, and Firefox. It is after all required by the RFC. And I am not aware of any way that you can end up with the extra bytes being nonzero other than a.) an actively malicious producer or b.) leaking memory.

domenic commented 3 years ago

Why? Is there a specific confusion you are worried about which would be caused by various web platform APIs being slightly more forgiving in what they accepts than the default for this API?

Yes. I think it's bad if web developers who want to implement the same semantics as browsers (pretty common, e.g. parsing data: URLs) need to reimplement the base64 decoder from scratch. Or worse, they try the language built-in one, and only after some painful experience realize that it doesn't match what they want.

I also think its bad if atob and ArrayBuffer.fromBase64() give different results.

domenic commented 3 years ago

As a concrete example, I think it'd be bad if packages such as https://github.com/jsdom/data-urls or similar were not able to use this language feature.

bakkot commented 3 years ago

I think it's bad if web developers who want to implement the same semantics as browsers (pretty common, e.g. parsing data: URLs) need to reimplement the base64 decoder from scratch.

That's a strong argument for supporting the same semantics as used by the web platform, which I definitely agree with, but not (that I can tell) an argument for making it the default. If you try to use the strict default and it doesn't work for you, the error message could easily say "if you want to skip validation of padding, pass { padding: false }" or something to that effect, so I don't expect it to be very painful if encountered - and as mentioned above, I don't expect it to be encountered very often except with malicious input, which is precisely the input you want to reject.

I also think its bad if atob and ArrayBuffer.fromBase64() give different results.

Well, atob gives a string, not a buffer, so they definitely are going to give different results. Other than that, the only difference would be that atob would accept some things that fromBase64 would reject by default. That seems like it is not so severe a problem to me, unless there is some specific confusion you are worried about.

jridgewell commented 3 years ago

Chiming in with a real world example: AMP has a weird protobuf which we encode into some URLs, and for some reason, the Java printer for this protobuf omits the padding characters. I have investigated, but have no idea why this happens.

bakkot commented 3 years ago

Incidentally, here's two cryptographers at Google who were surprised by the non-strictness of many implementations of base64 decoding and speculated that it had probably lead to vulnerabilities: @SchmiegSophie and @FiloSottile.

I am pretty well convinced that we should default to the stricter thing, and have opt-in looseness for use cases such as jsdom's and AMP's which need it.

domenic commented 3 years ago

Can we have it an option for web browsers to switch the default (like in Annex B or something)? I think web platform consistency is important and I'd rather not expose a second base64 decoder on the web, but if at least we kept the default the same in this API as on the rest of the web platform (relegating the non-web variant to a non-default option) I would be able to stop from objecting to the proposal.

bakkot commented 3 years ago

I would like to understand your reasoning better before commenting on that option. I didn't understand your above answer; as I said, it seemed like an argument for making the looser, less secure behavior available, but not making it the default, and I still don't understand why you want that enough to override the security concerns pointed out by cryptographers at your own employer.

domenic commented 3 years ago

Web platform consistency is important. Having different ways of performing base64 decoding exposed on the web platform is suboptimal, and making developers have to read the docs before they get consistency is IMO a nonstarter.

/cc @annevk as he has worked quite hard to get the web platform on the same page here in its various base64 APIs, and I imagine Mozilla might have similar concerns about exposing (much less making the default) one which does not match the rest of the web platform.

bakkot commented 3 years ago

I agree that consistency is valuable, but this would hardly be the first time that a modern version of a legacy API took the opportunity to simultaneously choose stricter defaults. Most developers won't see a difference, and the ones who do - that is, those who actually do end up being given incorrectly padded input - should be thinking about whether they actually want to accept such inputs.

syg commented 3 years ago

Chiming in with a real world example: AMP has a weird protobuf which we encode into some URLs, and for some reason, the Java printer for this protobuf omits the padding characters. I have investigated, but have no idea why this happens.

@jridgewell Has this been fixed since then (to use a different Java printer or something), or there's still ongoing dependence on loose decoders? How does the rest of the stack deal with the looseness?

syg commented 3 years ago

@domenic I disagree that lack of consistency here is a hard blocker. It's a goal, other things being equal, we should shoot for. In this proposal, the other thing here is security, which is not equal, so the crux to me remains the claim from @bakkot that "most developers won't see a difference". If that turns out to be true, the stricter default seems strictly better.

I find the tweets pretty convincing evidence of the badness of the footgun. It seems totally reasonable to assume that byte-equality of base64 input implies byte-equality of the output, and its not being so is pretty bad?

That said, having a host-configurable default is a compromise I can live with, but for a proposed language facility, it seems like a good opportunity to nudge future uses away from the non-bijective footgun. Ideally, if we can get away with it the right thing here seems to be to change the default for HTML to be the bijective one as well. While we can retrofit options bags or something to atob, how to do that for data: URLs and the like isn't clear to me. Where in the HTML spec is forgiving-base64 decode invoked for data URLs? The only direct invocation I see in the HTML spec is in atob, and the data: URL RFC doesn't talk about which decoding algorithm to use.

Finally, there's been one anecdote from AMP that there's a weird protobuf they've run into. I would like to hear some more.

domenic commented 3 years ago

I ran some tests to see what the non-browser ecosystem is doing. Given the input string "SQ" (non-padded):

Node Buffer.from("SQ", "base64"): accepts it without complaint, giving back the right bytes
Deno standard library: returns the right bytes
js-base64: Base64.isValid() returns true and its decode(), atob(), and toUint8Array() functions return the right strings/bytes.

I think the JavaScript ecosystem is pretty uniform here as to what base64 means to them, and TC39 should not try to use its position to change what's accepted practice and defaults.

If that turns out to be true, the stricter default seems strictly better.

I don't agree with that. I don't think stricter is better here. Accepting input that the rest of the JS ecosystem accepts is a good thing and we should require it. It's a tradeoff between, IMO, needless compliance to standards that people are not following anyway, versus alignment with the ecosystem. I think theoretical purity should be lower on the priority of constituencies here, versus developers getting a uniform environment.

It seems totally reasonable to assume that byte-equality of base64 input implies byte-equality of the output, and its not being so is pretty bad?

I don't think that's a reasonable assumption, since so many implementations (including the web platform itself) it doesn't hold.

Where in the HTML spec is forgiving-base64 decode invoked for data URLs?

See https://fetch.spec.whatwg.org/#data-urls

syg commented 3 years ago

I think theoretical purity should be lower on the priority of constituencies here, versus developers getting a uniform environment.

It's not my reading that anyone is arguing for theoretical purity here. My reading of @bakkot's argument on the priority of constituencies is serving the end user by having a new API that's less prone to misuse insecurely (though you disagree here).

I don't think that's a reasonable assumption, since so many implementations (including the web platform itself) it doesn't hold.

My takeaway is different. Because both expecting a binary encoding to be bijective is reasonable and the reality is that it doesn't hold in JS implementations, I'm inclined to think that the sloppiness is rare in practice.

Edit: Real web devs please chime in about the reasonableness claim I make here. Seems reasonable to me, but maybe it's a gotcha everyone knows?

lucacasonato commented 3 years ago

I would expect the byte string returned from btoa to exactly match a byte string created from the result of ArrayBuffer.fromBase64() for all input values. (so forgiving base64 decode, with an option to enable it).

Also ArrayBuffer.fromBase64("SG", { padding: true }) == ArrayBuffer.fromBase64("SG") being the true, with ArrayBuffer.fromBase64("SG", { padding: false }) being the opt out is in violation of web platform design prinicples, so I would strongly oppose against this particular API. Omitting a value in an options bag should result in it being considered falsy. The argument should be switched to disablePadding (with the boolean value inverted), if the default continues to be padding: true.

bakkot commented 3 years ago

I think the JavaScript ecosystem is pretty uniform here as to what base64 means to them

The JS ecosystem is not uniform. The base64-js package, which is substantially more commonly used than js-base64 package you link, does check padding length.

I think theoretical purity should be lower on the priority of constituencies here

As Shu says, this argument has nothing to do with theoretical purity, only about security of applications, which affects users directly.

jridgewell commented 3 years ago

Answering @syg's question:

Has this been fixed since then (to use a different Java printer or something), or there's still ongoing dependence on loose decoders?

Nope, still present. We're using Google's standard Java base64 encoder, but we're not calling omitPadding(), so I don't know why it's happening. Could be present everywhere in google3 for all I know.

How does the rest of the stack deal with the looseness?

The parser we use works just fine without padding. absl's unescape has special handling for padding chars, falling into a slow path. Omitting the padding actually makes the parser faster. 🤷‍♂️

waldemarhorwat commented 3 years ago

A number of comments have alleged that one of the options is better for security than the other, but without presenting any good evidence so far. An observation was made that the relaxed decoding is not bijective. That's true but I'm not sure what the point is because the strict decoding is not bijective either.

Is there any evidence? The only one I've seen so far is that security problems tend to arise when there are two or more subtly different ways of parsing adopted by an ecosystem, so it's usually best to stick to the dominant one.

bakkot commented 3 years ago

the strict decoding is not bijective either

Isn't it? If you considering the domain to be the space of all strings which do not error when passed to the decoder and the codomain to be the space of all byte sequences, I'm pretty sure strict decoding is one-to-one (injective), and it's obviously onto (surjective). (By "strict" I mean decoding which validates padding both for the = bytes and for the possible extra bits in the last characters, and which errors when passed a string containing characters outside the alphabet.)

The RFC talks about "canonical encoding", which certainly sounds like "the decoder is one-to-one" to me.

Is there any evidence?

Well, I linked a couple of comments from cryptographers; I'm not sure what other evidence you're looking for.

In my experience it is very common for people to write code which assumes that base64 is canonical: for example, they compare base64 strings. And it's quite easy to run into a scenario where this assumption, if it does not hold, breaks your security. (As a concrete example of how this could happen, imagine an application which keeps a list of the base64 encodings of revoked keys and checks membership of user-supplied keys in this list by string equality.) I have a much harder time coming up with an example of a plausible program where rejecting incorrectly-padded inputs leads to a security vulnerability.

waldemarhorwat commented 3 years ago

@bakkot: I'm basing my point on what you wrote earlier in this thread:

"As far as I can tell the only differences in forgiving-base64 are that it allows you to omit padding and to have non-zero bits in 2 or 4 "overflow" bits."

There are two possible sources of differences:

padding and overflow bits
whitespace

Either one can result in non-bijective mappings.

bakkot commented 3 years ago

Apologies for the confusion. In addition to ignoring incorrect padding, forgiving-base64 also strips all whitespace by default. I would prefer to reject any characters outside of the alphabet by default, and have a way to opt in to ignoring whitespace.

annevk commented 3 years ago

As I understand it browser code bases have a lot of base64[url] libraries in them and I do not think anyone has done the work of auditing all of them. (I mainly noticed when working on data: URLs that they shared logic with atob/btoa.) E.g., https://w3c.github.io/push-api/ uses base64url and I'm not sure to what extent that has been tested with regards to padding. It would be interesting to see a more thorough analysis.

bakkot commented 3 years ago

https://w3c.github.io/push-api/ uses base64url and I'm not sure to what extent that has been tested with regards to padding

Thanks for the example. The Push spec defers to RFC7515, which specifies encoding is to be done "using the URL- and filename-safe character set [...] with all trailing '=' characters omitted (as permitted by Section 3.2) and without the inclusion of any line breaks, whitespace, or other additional characters."

And indeed I observe that Chrome and Firefox both enforce the absence of any padding or whitespace in the applicationServerKey. (Blink also has a test to that effect.)

Although, somewhat horrifyingly, both Chrome and Firefox accept strings which have non-zero bits in the last character, possibly because RFC7515 makes no mention of how to handle that case (it may simply not have occurred to them - they were presumably aiming for a canonical encoding, since they require the absence of whitespace, so I imagine they would have required the extra bits be zero if they'd been aware of that possibility).

I also observe that Crypto accepts JWK-formatted keys, which are defined in the above-mentioned RFC7515, and which therefore also enforce the absence of whitespace or padding. And WebAuthn (like CSP) is phrased as comparing a user-provided value against the output of running a specific encoder, which means it is strict about everything.

To summarize, among the examples discussed so far:

atob: base64, permits both presence and absence of =, extra bits, and whitespace
data: uri: as above
Push: base64url, requires absence of =, agnostic on extra bits, requires absence of whitespace
Crypto: as above
CSP: mixed base64/base64url, requires presence of =, requires absence of extra bits, requires absence of whitespace
WebAuthn: base64url, requires presence of =, requires absence of extra bits, requires absence of whitespace

I'd definitely be interested in any other examples. There does not seem to be any kind of consistency at the moment, though; the claim that "forgiving base64 decode is used by browsers for all their base64 needs" seems false.

annevk commented 3 years ago

Is it false? What you have written above makes me think that for CSP they could use the same decoder as the equality check only involves the encoder. And everything else seems to be base64url, which can indeed be different. It would be nice if they could all be the same base64url though.

bakkot commented 3 years ago

I suppose it's false only if you regard base64-with-urlsafe-alphabet to be a kind of base64, but I think most people would. I do, at any rate.

What you have written above makes me think that for CSP they could use the same decoder as the equality check only involves the encoder.

It's phrased in terms of the encoder, but it could just has easily have been phrased in terms of performing a strict decoding - that is, where currently it says roughly "take the input, hash it, and apply base64 encoding to the result; take the provided string and replace - with + and _ with /; accept iff the two strings are identical", it would be equivalent to say (and is probably best understood to mean) "take the input and hash it; take the provided string, replace - with + and _ with /, and apply strict base64-decoding to the result (rejecting on failure); accept iff the two byte sequences are identical".

annevk commented 3 years ago

I'm looking at this from the perspective of what the minimal abstraction would be that satisfies all these callers.

It seems to me that if removal of whitespace and = code points was a pre-processing step done by forgiving base64, it and base64url needs could share the remaining algorithm. (Assuming we continue to use the encoder for WebAuthn and CSP and I don't really see why we wouldn't.) I guess I'd lean towards exposing that through an API (and requiring those that want to match data: URLs and atob/btoa to implement the pre-processing step).

bakkot commented 3 years ago

It seems to me that if removal of whitespace and = code points was a pre-processing step done by forgiving base64, it and base64url needs could share the remaining algorithm.

I'd prefer to phrase the "removal of valid = padding, if present" part as "addition of = as necessary to have valid padding", at which point it is the decoding this proposal currently defaults to (i.e., the canonical base64 decoding).

(Except for the question of whether to reject excess non-zero bits in the last characters, though again that could be understood as forgiving-base64 performing normalization of those bits before invoking this API.)

annevk commented 3 years ago

Well, the fact that ignoring bits is also done for base64url makes me think it's across all callers, no?

bakkot commented 3 years ago

It's not done for CSP or WebAuthn.

(It's also just a perverse thing to do; I'm confident it could be changed across the web platform without breaking anything. There is no reasonable way to end up with extra bits without either leaking memory or malicious input. So that is a difference which no one would notice except if it caused them to reject malicious input which they'd previously been accepting.)

annevk commented 3 years ago

Given the way CSP and WebAuthn are defined that's not observable, right?

bakkot commented 3 years ago

As a user, what I see is that if I have a script containing just console.log('hi'), and I have a CSP with script-src 'sha256-1ohZFo3B9w3UOFBbfx6JSomkpkME90iPs1r/qXzvX7Y=', then my script will load, but if I change the trailing Y to a Z (thus setting an extra non-zero bit in the last character), then my script will not load.

(Per spec and Firefox; Chrome does not correctly implement the spec.)

So as a user it certainly seems observable.

annevk commented 3 years ago

If that's done by encoding console.log('hi') and then comparing, how would you notice as it only affects the decoder? (If Chrome in fact decodes the script-src line and then compares, that would be even more evidence that the decoders all have that behavior.)

bakkot commented 3 years ago

As a user, I neither know nor care whether the specification happens to be phrased as "base64-encode the script's sha and compare to the provided string" or "strict base64-decode the provided string and compare to the script's sha".

What matters is that the behavior I actually observe is that this is a place where I am providing a base64-encoded string, and if I provide a string which has extra non-zero bits then it will not work. As such, it would be consistent with that behavior for the decoding API in this proposal to likewise reject base64-encoded strings which have extra non-zero bits.

annevk commented 1 year ago

I looked into this further, WebKit essentially has these decoding modes in terms of code paths being used:

Default, ignore padding
Default, enforce padding
Default, enforce padding and ignore whitespace (forgiving base64)
URL (ignores padding)

Notably none of these enforce "Canonical Encoding" of the RFC and for base64url it doesn't enforce padding either.

Two thoughts so far:

I'm not convinced it's worth adding novel code paths just for JavaScript.
I would be open to investigating whether more of these can be strict (though forgiving base64 has to remain as-is of course).

(For encoding there are exactly two modes, Default and URL, those are hopefully less controversial.)

bakkot commented 1 year ago

Ugh. Well, this is certainly a messy state of affairs.

I am, I guess, ok with having the default behavior for the decoder be to not enforce anything about padding and to strip whitespace. In principle this has weird and somewhat subtle security implications in some applications, so I'd really like to default to only allowing canonical encodings, but the fact that nothing else enforces canonicity does have weight.

@annevk

Default, enforce padding and ignore whitespace (forgiving base64)

forgiving base64 does not enforce padding (either the excess bits or the presence of =), so I'm confused by this.

I guess it does enforce that padding = characters, if present, are consistent - i.e. there's exactly 1 or 2 and they bring the length of the string up to a multiple of 4. Is that what you meant?

annevk commented 1 year ago

Yeah, that was sloppy wording on my part.

gibson042 commented 1 year ago

This conversation has mostly focused on padding, but there's also a possible whitespace consideration: https://github.com/tc39/proposal-arraybuffer-base64/pull/27#discussion_r1375304050

I find it strange that the WHATWG definition of ASCII whitespace excludes U+000B VERTICAL TABULATION, but it does, and that would now affect ECMAScript. Accepting AA\u000C== while rejecting AA\u000B== presumably doesn't matter much, but is still worth explicit recognition.

bakkot commented 1 year ago

Right now my plan is for the default to exactly match atob, with a strict: true option which would enforce canonical encodings.

I don't think there's much reason to accept the weirder kinds of whitespace. I don't think anyone will literally ever run into the lack of U+000B support, so I'd prefer to stick with matching atob.

bakkot commented 9 months ago

Ended up slightly different from above, so to summarize:

the default behavior exactly matches forgiving-base64 (though the algorithm is written differently so it can be shared)
if you specify lastChunkHandling: 'strict', then padding must be included, and the padding bits must be 0, but whitespace is still ignored (so to enforce canonical encodings you must enforce absence of whitespace yourself)

tc39 / proposal-arraybuffer-base64

How does the base64 chosen here compare to atob/btoa? #5