Subresource Integrity integration

yoavweiss commented 6 years ago

How would a binary AST encoded resource be delivered when SRI is involved? How should the hash be calculated on both the encoder and the decoder sides?

Yoric commented 6 years ago

If I understand correctly, the hash is independent from the content encoding, right? If so, that's going to be complicated.

kannanvijayan-zz commented 6 years ago

There are two options here: one is to hash the (normalized) source text. Another is to hash the "simple" encoding of BinAST, prior to any compression steps.

The latter seems more appropriate in this case.

yoavweiss commented 6 years ago

I don't think it's appropriate for a transfer-encoding to change the semantics of SRI (e.g. gzip and brotli don't)

kannanvijayan-zz commented 6 years ago

@yoavweiss I'm not sure I understand where there would be a need for semantic changes. Could you elaborate?

yoavweiss commented 6 years ago

Currently SRI hashes are hashes of the content before gzip/brotli are applied. If AST encoding is just a content encoding, the same principles should apply, and SRI hashes should be calculated before AST encoding and after AST decoding is applied.

kannanvijayan-zz commented 6 years ago

But can't we express this simply as a hash variant, which is already a supported concept in SRI?

More formally, a hash value H(F(x)), where F is some normalization function under some equivalence class we care about, can simply be restated as G(x) where G is treated as a slightly modified (but trivially of equivalent strength) hash function.

This really feels more like a nit than an actual issue of semantics.

kannanvijayan-zz commented 6 years ago

@yoavweiss Hold on, I think I understand the problem a bit better now. I see where the issue is.

The problem is we're multiplexing the URL to serve both BinAST and plainjs files, but we only have one SRI to serve up.

kannanvijayan-zz commented 6 years ago

It seems there isn't a way to slice this salami without introducing a hash specifically for the BinAST code. This would require the referrer page to include two hashes. From a standards perspective this is not a major issue - an extra hint attribute that will be ignored by other browsers. Firefox, when requesting SRI-checked resources, would add the binast mimetype to the accept header when it detected the presence of the second hash, and verify using that.

The problem here is that it requires changes on the content provider end - the referrer page must be modified.

However - I'd assume that SRI hashes are generated by toolchains these days anyway (as you'd want to recompute them on changes to source). Is that the case? If so whatever process that is should be modifiable to also produce a BinAST hash and include it as well.

@yoavweiss What do you think?

yoavweiss commented 6 years ago

The problem here is that it requires changes on the content provider end - the referrer page must be modified.

Yeah, that adds a lot of complexity to the developer's flow and forces the page to know if some of its scripts will be binAST encoded, and if so, add two hashes instead of one.

However - I'd assume that SRI hashes are generated by toolchains these days anyway (as you'd want to recompute them on changes to source)

If you have a script that blindly adds SRI hashes, that won't help you if/when the origin gets hacked (which is a major use-case for SRI).

Overall, this seems like a discussion that should happen with the SRI folks.

/cc @mikewest

mikewest commented 6 years ago

@mozfreddyb, @fmarier, @metromoxie, and @devd are the "SRI folks". :)

@otherdaniel might also have thoughts.

Also, https://tools.ietf.org/html/draft-thomson-http-mice-03 is relevant.

otherdaniel commented 6 years ago

My context on 'binary AST' is a bit outdated, but my understanding is:

There is not (and cannot be) a 1:1 mapping between source and "binary AST". E.g., the binary AST drops source code comments, non-relevant whitespace (that is, whitespace outside of string/template literals), and maybe (some?) variable names. If so, you cannot reconstruct the original. If so, you also cannot compute the original's hash. (The FAQ makes a similar point.)
I'd know in theory how to build a hash that can survive this transformation (by normalizing the expendable parts in either representation, and then hashing), but that would effectively force a nearly-complete parsing step during hash calculation. I'm going to suggest that isn't happening.
I think that considering 'binary AST' to be a transfer encoding of .js is just not tenable. (Also for other reasons, like the "Early Error Semantics" chapter on the page.) I'd think a 'binary AST' is for all intents and purposes a separate resource, each with their own hash sums (over their respective byte sequence representations).
It's up to whoever includes that separate resource to also supply the appropriate SRI attributes. If both resources are served under the same URL, then all hash-sums should be in the integrity=... attribute.

I suspect this answer won't make Yoav very happy, but I'm having a really hard time imagining a solution where 'binary AST' could be served transparently and with integrity. 'Binary AST' just does a lot more than a mere content encoding could be expected to.

Yoric commented 6 years ago

There is not (and cannot be) a 1:1 mapping between source and "binary AST". E.g., the binary AST drops source code comments, non-relevant whitespace (that is, whitespace outside of string/template literals), and maybe (some?) variable names. If so, you cannot reconstruct the original. If so, you also cannot compute the original's hash. (The FAQ makes a similar point.)

Variable names are maintained, and we have ideas for making source code comments stripping optional, but yes, that's the general idea.

I'd know in theory how to build a hash that can survive this transformation (by normalizing the expendable parts in either representation, and then hashing), but that would effectively force a nearly-complete parsing step during hash calculation. I'm going to suggest that isn't happening.

Ah, well, I was about to suggest that.

Out of curiosity, when (and how often) is hash calculated?

kannanvijayan-zz commented 6 years ago

Talking around the office, a colleague observed that what we are trying to do here is in effect comparable to srcset tags on images. Conceptually: we want to identify an abstract resource which can be supplied by one or more different (but equivalent, under some criteria) representations of it.

In general I agree with @otherdaniel's assessments. I'm not sure I agree on the "it's not a content encoding" bit. We're running into this issue because we're using hashes to check resource integrity, and hashes are inherently tied to the representation of a particular piece of content.

They're convenient because representational equality subsumes all other equivalence class models axiomatically. As you noted, theoretically we could store delta(normalizedJS, originalJS) along with the BinAST representation and use that to satisfy the single SRI requirements. The reason we don't want to do this is purely performance and unnecessary complexity.

Philosophical waxing aside, though, I agree it seems we can't slide this through purely transparently on a mime-type basis and still keep SRI support.

mozfreddyb commented 6 years ago

Hold on. You can easily support binary AST with SRI as it is!

Example:

<script src="https://example.com/example-framework.js"
        integrity="sha384-hash-of-normal-JS-file
                   sha384-hash-of-binary-ast-file"
        crossorigin="anonymous"></script>

The user agent will notice that there are multiple hashes with the same strength (i.e., sha384), so only one of them has to match. User agents supporting binary AST, will receive a file that matches the second hash. User agents without support, will receive the JS file, that then matches the first hash.

(This is a rephrasing of Example 7 in the SRI specification. I've quited it for this example and rephrased for clarity, but feel free to read the original source!)

otherdaniel commented 6 years ago

@Yoric Currently, in Chrome/Chromium, the hashes are checked once, after the network has delivered the last byte to the renderer, just before the resource is being used. There is a very annoying but hard to fix bug where sometimes that doesn't work and we reload and recheck the resource. The intent is to move this 'lower' into the browser process or network service, although I'm not sure if or when this is happening.

@kannanvijayan Granted, one can see the "content encoding" thing either way.

One additional thought: Hashes apply universally to all resource types, and have well-understood security properties, and are hard to mis-use. I bet that once a js-equals-binast-hash is created, some clown will create a pair of .css files (or other resources) that are equivalent under that hash but have otherwise quite different properties. And while obviously a js-specifc hash shouldn't be applied to non-js resources, similar things have happened elsewhere (e.g. MIME-type confusion attacks) and this might lead to similar problems.

@mozfreddyb Yes, that works as of today. I think the use case implied here is that 'binary AST' can be applied transparently by the web server or a CDN, just like those instances could decide to apply gzip without requiring the page author to change the page. I find that a super valid use case, and without a capability like that deployment will be a good bit harder. But so far I'm not seeing a good mechanism that would facilitate that.

--

Generally speaking, I expect a custom hash with any appreciably complexity is going to be a very hard sell, to both implementor and security communities.

kannanvijayan-zz commented 6 years ago

@mozfreddyb I did not realize the integrity attribute supported multiple hashes! Thanks for bringing that to our attention.

As @otherdaniel noted, it doesn't get us to full mimetype-only level transparency, but it's still a far step above another hint attribute on script tags. Good to know!

tc39 / proposal-binary-ast

Subresource Integrity integration #48