rails / sprockets

Rack-based asset packaging system
MIT License
932 stars 792 forks source link

Prefer MD5 as default hash instead of SHA256 #124

Closed Fjan closed 8 years ago

Fjan commented 8 years ago

The default digest recently went from MD5 to SHA256 in Sprockets 3. Can we please change that back to MD5 because I think it is wasteful and provides no benefit.

Note that there are good arguments to prefer SHA256 in favour of MD5 for other reasons, such as resilience against cryptographic attacks, but those reasons do not apply here. I know that Sprockets offers the ability to override the default digest but this can't be easily changed from Rails (yet), which is where most people use it from. (Considering the number of assets included on the average rails page, with billions of pages served by rails every day, this one change could save an entire rain forrest :smile: )

schneems commented 8 years ago

Thanks for the interest in the project. The original maintainers who made that decision aren't around anymore.

Can you find the commit or commits where this default was changed? I'm sure there's a backstory.

Do you have any benchmarks showing significant speed increase in a sprockets project by changing the hash method?

The easiest thing to make progress on this would be to submit a PR to sprockets-rails to add a easy to change config option. Ping me on a PR there and I'll help review.

Fjan commented 8 years ago

Thanks for the quick attention! I haven't been able to discover the reasoning, and I'm as curious as you are. I can find a blame in the configuration files pointing to @josh and the change from MD5 to SHA is mentioned in the changelog but I haven't found any related issue explaining the rationale behind the change.

There are benchmarks that show MD5 is 40-70% faster than SHA256 (example), but I think the primary reason to prefer it over SHA is to reduce page size, by having a smaller digest.

Also, digests by their nature have no redundancy and so reduce the ability of apache to zip / deflate the web page properly.

We'll start working on a PR.

schneems commented 8 years ago

Josh wrote most of this project, please don't @-mention him, he's off this project and needs some peace.

There are benchmarks that show MD5 is 40-70% faster than SHA256

I hear you, but we need to test overall time. If the time spent calculating digests may be trivial when compared to reading and writing files. If that's the case then 70% faster of trivial isn't going to save us much time. If it's a hotspot, maybe it will save us a ton of time.

If the proposed change was a micro optimization that is 100% backwards compatible I would consider merging even if speed savings isn't that much, however this is a breaking change. I treat performance patches like bug fix patches. The net benefit needs to outweigh whatever costs there are to making the change. If it speeds asset compilation up 50% then you bet your butt i'm going to spend a bunch more time trying to get a patch in. If it saves us less than 1% it's probably not worth pursuing.

Even writing the benchmarks to see if this is a performance hotspot are worthwhile in and of themselves.

but I think the primary reason to prefer it over SHA is to reduce page size, by having a smaller digest.

Do you have any examples we can benchmark here? Most assets are bundled together like css and js so the main asset only has one url. Which if you have one of each, this proposed changed would save you exactly 64 characters which would not come close to being detectable over a http request. Now those files may have multiple images, If you're using the same image multiple times than gzipping would take care of those extra characters (btw I want to bring sprockets produced gzipped assets back so apache doesn't have to do it on the fly). Even if all those images are different, the end user only has to pay the extra character penalty for that asset once as the browser will cache the file.

I agree this would save characters, i'm not 100% positive it would actually affect end user experience.

Without benchmarks we won't be able to measure progress.

Fjan commented 8 years ago

My intuition says the speed benchmarks will not show any drastic improvements for individual sites, although if digests are added to each image it can add up for some sites. Perhaps it sounds old fashioned (i'm old), but this is more about elegant engineering: By saving 64 bytes and a few thousand processor cycles on each page multiplied by the billions of pages served by Rails each day will probably do more good for the environment than, I don't know, a lifetime of biking instead of owning a car.

I noticed that all the tests have hard-coded output that assume the new default SHA256, so what I assumed would be a one-line change actually results in 44 failed tests. Giving up on the PR, sorry. If you ever find out the rationale behind the change I'd be interested though.

schneems commented 8 years ago

Thanks for the ping and the conversation.

My intuition says

You never know. I've been a downer in this issue but we might see some serious savings. I always push benchmarking because i've been wrong so many times. Even when i'm right I can still use the benchmarks to show improvement.

If you make that PR to sprockets-rails it would help me test out this proposed change pretty easily. I'm also wondering if there's any non-cryptographic hashes that are extremely fast and minimize hash collisions (which is what i'm guessing the original logic was to prevent).

Feel free to re-open this issue any time if you want to dig in some more.

Fjan commented 8 years ago

Thanks. A quick monkey patch shows that adding an option to the sprockets-rails gem to override the digest that Sprockets uses is indeed a one-line change so that may be easier, I'll look into doing a pull request there.

There is a comment in lib/sprockets/configuration.rb that says #digest_class= is deprecated though, and that's the one we need to use. Any idea why that would be deprecated?

rafaelfranca commented 8 years ago

The reason for using SHA256 by default is that some browser has a canary security feature to check assets integrity and this feature uses SHA256 as digest. See https://github.com/sstephenson/sprockets/pull/647

schneems commented 8 years ago

Blame shows https://github.com/sstephenson/sprockets/pull/647

schneems commented 8 years ago

Thanks @rafaelfranca here's the reason we can't use MD5 https://github.com/sstephenson/sprockets/pull/645#issuecomment-58115477

Fjan commented 8 years ago

Ah, thank you very much for digging that up and clarifying what happened. I think the participants in that discussion may be confused between hash collision and hash integrity. MD5 is indeed not safe anymore for integrity but the digest in Sprockets is actually only used to prevent collision, for which MD5 is perfectly OK. (Perhaps there is a use case where Sprockets would actually be used for integrity checking? I think that would be rare).

I don't think a browser would ever be able to say anything useful about the digest being used based on the asset name, this is probably a misunderstanding. There are hundreds of Rails apps in production using Sprockets 2 and I've never heard a browser complain about those. (And, if so, the browser is at fault).

Anyway, it would seem like trying to get this added as a user selectable option in sprockets-rails is the way to go. Do you think that the comment that suggests other digests will be deprecated also stems from the same (mis)understanding of hashes?

rafaelfranca commented 8 years ago

Yes, both sprockets have integrity check and browser are able to say something useful about the digest http://www.w3.org/TR/SRI/#htmlscriptelement-1

Fjan commented 8 years ago

Hi @rafaelfranca: That document is actually about resource integrity and browser can perhaps say something about that, but Sprockets uses the digest purely for resource collision, so that document does not apply here. Or are you saying there is use case where sprockets is actually involved in integrity checking of assets? Can you elaborate what that would be? Serving PDF files for a bank maybe, but I doubt you would use sprockets for that.

rafaelfranca commented 8 years ago

lol, wrong button.

Yeah, one of the reasons that sprockets changed everything to SHA256 is that now it can do resource integrity https://github.com/rails/sprockets/blob/2fd1a89b3e7eff241b87f1f2e51a813c3f172cb3/lib/sprockets/digest_utils.rb#L136-L141.

We could have changed only the resource integrity but for consistence we choose to change the assets digest to use the same algorithm, it still seems a fine decision to take. I don't think there is any strong reason to keep using MD5 or having it configurable.

Fjan commented 8 years ago

Well, I think needlessly adding 100+ bytes (at least one css and one js file) to pretty much every page served by the most popular web framework in the world for a use case that is exceedingly rare is extremely wasteful. 0.01% page growth for every internet page is probably the energy consumption of a small country! Think of the children :smile:!

Anyway, kidding aside, leaving in the option for the 99.9% who are not financial institutions serving PDFs would be much appreciated. I don't think actively removing the option is going to help anyone?

rafaelfranca commented 8 years ago

Yeah. We can undeprecate that option :+1:

cc @schneems

Fjan commented 8 years ago

Thanks! :+1:

rafaelfranca commented 8 years ago

I'll keep the option deprecated for 4.x and remove in 5.x if we don't have a strong reason to keep it. Thank you.

Fjan commented 8 years ago

What made you change your mind? I would say saving that many bytes per page is a great reason to keep the option, especially since it's already in there. Is there any particular reason to remove this useful option?

schneems commented 8 years ago

It's yet another code path to support. I'm not convinced that a few bytes are really what is causing your page to be so slow that a user will abandon it. If you're using something like https://robots.thoughtbot.com/content-compression-with-rack-deflater then those bytes will be even more insignificant. I think this is a performance red-herring.

I haven't looked into overall complexity. If un-deprecating this is trivial, I guess I don't mind keeping it in. The only down side I can see is that since you own your custom hashing scheme we can't support when something goes wrong, and we can also make less assumptions moving forwards. For instance if we wanted to bake in the referential integrity checks into Rails, if we don't control the algorithm used to generate the SHA then we would have to bake in multiple code paths. It's not always that a single feature is complex to implement, but that complexity spawns complexity elsewhere or makes making assumptions or other optimizations difficult.

Fjan commented 8 years ago

Perhaps a more principled point can convince you: MD5 is a superior hashing scheme for collision hashes, both in terms of time and space. A programmer who takes their craft seriously should aim for perfection. It is definitely true that you can use a deflater to counter the performance effects of a poor hash choice, but is that really an argument you are proud of? (by the way: a SHA256 does not deflate well, it has high entropy)

A less optimal hash (SHA256) was chosen mainly because the implementer didn't fully understand the difference between integrity hashes and collision hashes, although it does have the side effect that sprockets can now be used to generate integrity hashes, which is nice, I guess, although I would challenge you to find anyone actually using Sprockets for that purpose. But actively removing a more optimal solution just so there is less code to maintain does not seem like the right way forward.

rafaelfranca commented 8 years ago

I guess, although I would challenge you to find anyone actually using Sprockets for that purpose.

Both Shopify and GitHub are using Sprockets for that purpose. http://githubengineering.com/subresource-integrity/

I really don't want to bikeshed in the hash algorithm that is being used. I don't see any practical reason besides "use what I think is the the most perfect algorithm in the world" to keep this option. I believe that 99% of Sprockets users don't care which algorithm is being used and if it is for decrease the complexity of the library I'd not hesitate in dropping an configuration that is not used.

Fjan commented 8 years ago

Bike shedding is a term used to describe an argument that hinges on personal preference, it's really not a valid argument against using an objectively more optimal choice. I think that 99% of Sprockets users would prefer the more optimal algorithm given the choice.

Github uses Sprockets for subresource-intergrity, but if you do a "view source" on this very page you will see that most included assets (images) do not use it. Perhaps because their engineers also did not like adding 60 useless characters to every asset on the page. If you're a company the size of github with a dozen assets on every page these things add up.

rafaelfranca commented 8 years ago

I think that 99% of Sprockets users would prefer the more optimal algorithm given the choice.

Yeah, I want to see that this is why I'm not removing it from Sprockets 4 and waiting for Sprockets 5. If users want to use a different algorithm, they will let us know when we removed it. If they don't care, like I'm supposing, we will not have any complaint and it will be removed.

Github uses Sprockets for subresource-intergrity, but if you do a "view source" on this very page you will see that most included assets (images) do not use it. Perhaps because their engineers also did not like adding 60 useless characters to every asset on the page.

No, they are not using because subresource-integrity doesn't matter for images only for JavaScript and CSS assets that can lead to security attacks.

Also GitHub was the main motivation to this algorithm change since who introduced the change works there. So no, they don't care about 60 characters in every assets, if you look at the source code of this page you will see it is using SHA256 in the hash.

Fjan commented 8 years ago

But that's exactly my point: the collision hash and the integrity hash are being conflated. Github is using an integrity hash where it matter: on CSS and JS, but they are forced by sprockets to use it as a collision hash in the URL as well. That might be why they are not serving the majority of their assets through Sprockets (could be other reasons, of course). Even simply doing a chop(55) on every generated URL would be an improvement (but don't do that, not all hashing algos have a distribution of digits where it is safe to do that).

rafaelfranca commented 8 years ago

They chose to use SHA256 and I still don't see any reason to change it back to MD5 sorry.

Fjan commented 8 years ago

That github chose SHA256 for two of the assets on this page (and not the other ones) is a compelling reason to include it in Sprockets. It's not an especially good reason to make it the default option. But I completely fail to see how you then arrive at the conclusion that removing MD5 as an option is a good idea.

I'm not sure I can convince you. Please discuss this with your friend at github, or few other people trained in cryptography before making a decision on this. Crypto is hard.

schneems commented 8 years ago

the difference between integrity hashes and collision hashes

Can you ELI5 here? I typed in "integrity versus collision hashes" into google and it gave me nothing useful.

should aim for perfection

I do not think we should be shooting for perfection. We should be shooting for pragmatic, maintainable, sustainable. Sprockets is the titanic right now, we need to patch holes, not polish brass.

I appreciate your contribution of experiences, please lend me those.

In addition to this issue there are many other things that are being talked about to speed up production page service. We've talked about using zopfli or brotli to decrease gzip file size, this could save way more than a few bytes per request. The zopfli gem is slow, if you've got some C perf experience it would be a good fit.

We also need to focus on usability which has been ignored mostly. Better errors, better docs. See https://github.com/rails/sprockets/tree/master/guides there's not a lot there right now. We need faster asset compilation, right now it takes too freaking long. We need source map support for sprockets 4. We need a sane deprecation strategy for the way we currently handle memoizing internally so that we can refactor the environment variable from needing to be passed everywhere.

The maintainer of the repo quit, removed gem push access. We're playing catchup. There's a big pile of issues, and bugs coming in while we don't even know how the internals fully work. I'm not guaranteeing anything, but if the reason for removing MD5 hash support is maintainability then if you help make sprockets more maintainable then maybe it won't have to go. You can triage issues http://www.codetriage.com/rails/sprockets you can reproduce bug reports, or help out with any of the things i've already mentioned. You obviously care about this repo and the issue at hand, maybe we can work together.

Fjan commented 8 years ago

Hashes can have many different properties (look at wikipedia 'hash function' for a primer), but two are relevant for Sprockets:

For the first one you want a hash that is long and difficult to calculate. For the second one you want one that is short (so it does not hamper deflate) and has a low chance of collision (that is: the chance of two different resources having the same name). SHA256 is the better choice to detect tampering, as there are some (theoretical) attacks again MD5, but MD5 is more efficient when it comes to ensuring uniqueness.

Because a URL is already globally unique (the U in URL) Sprockets only needs to ensure an asset does not collide with a future version of itself, and you could argue that even MD5 is overkill for that (in Rails 1 the assets stamp was only 6 digits IIRC). In an ideal world Sprockets would be able to use SHA256 for the subresource integrity hash and use a more efficient hash to generate the filename.

I appreciate the work you guys are doing immensely, but as the CEO of a fast growing software company my time is severely limited, so I have little of substance to offer myself. However, I do encourage my employees to contribute, I’ll see if I can get one of them interested in this project (they can work on it in company time, but I want them to be intrinsically motivated).

woot4moo commented 8 years ago

in the event anyone reads this later, FIPS enforcement would cause a default of MD5 to break.

Fjan commented 8 years ago

As far as I know FIPS enforcement only applies to encryption, I don't think there is anything in the standard about asset versioning. Would the sequential numbering that Github uses for asset versioning of images automatically mean Github fails FIPS enforcement, for example?

woot4moo commented 8 years ago

@Fjan MD5 is a cryptographic hash which is banned under FIPS, you could swap to SHA256 or something else that is allowed.

Fjan commented 8 years ago

@woot4moo my point is that Sprockets uses MD5 for asset versioning, it does not use it for cryptography in any way so FIPS does not apply. I know there are scanners that simply look for something that looks like an MD5 hash and flag that because they cannot tell that it's not used for cryptography. You could chop off a few digits if you want to circumvent those false positives.

woot4moo commented 8 years ago

@fjan sure but that doesn't matter. Running even a skeleton app with sprockets on a fips enabled system results in an error at startup because it invokes MD5.

Fjan commented 8 years ago

Ok, in that case it would take a bit more work to prevent false positives, like vendoring a copy of MD5 in Sprockets, but that may not be worth the effort.

woot4moo commented 8 years ago

I would just make it a properties file change that can be set.

silva96 commented 7 years ago

Hi, changing the hash from md5 to sha256 will expire all assets right? I'm upgrading a production app to rails 5 and I see all assets have different fingerprints. This is a hassle because all our users that have received emails from the rails 4 app will see broken images in their email clients when we switch to rails 5, since old images (inserted in the email html) won't be available.

I understand that browsers have this ability to check integrity and that sha256 is needed to generate that hash that will be compared with. But using this also for the filename generates the problem I'm describing above.

Any thoughts?

schneems commented 7 years ago

Yes changing the hash would cause all assets to be re-generated. However you should be storing more than just the most recently generate assets, otherwise your email users will have the same problem when you rev your assets.

By default rake assets:clean keeps 3 older copies. If you need an asset to live longer than that, you might consider embedding styles and images directly into emails.

Fjan commented 7 years ago

@silva96 The typical way to solve this is by not adding a hash to assets linked in email. You can configure rails/sprockets to generate a second set of assets without the hash. (Since assets linked in email cannot be CSS or JS it's rarely an issue anyway).

For existing emails you have no other option that to let them break over time since the maintainers have made it clear they do not want to support MD5 as an option anymore (which I still think is a pity since I believe it's a better hash than sha256 for >99% of use cases).

schneems commented 7 years ago

@silva96 The typical way to solve this is by not adding a hash to assets linked in email. You can configure rails/sprockets to generate a second set of assets without the hash. (Since assets linked in email cannot be CSS or JS it's rarely an issue anyway).

I recommend against this, if the asset changes drastically you won't know how the new one will look. Also this prevents you from doing things like setting far futures expires headers on your entire assets folder. The default behavior should be fine for most people (preserving 3 copies of assets), it's the behavior you get on Heroku by default and in the last 5 years of doing advanced ruby support I don't remember the last issue it's caused.

Embedding your assets is one way to guarantee that they'll work and that they'll look exactly how you want. If you link to a non-extension asset, then you're not given the same guarantee. Unfortunately there's no good way to embed these things from vanilla rails/sprockets. Maybe there's a gem out there, or if not, there probably should be. Alternatively you could put assets on S3 manually as they're needed.

So my advice in this order:

1) Don't worry about it until it's a problem, make sure you're doing the right thing with expiring and removing assets from your machine. You can also increase the number of older assets you need to keep. 2) If it is an issue embed assets. 3) If neither of those options work for you, manually generate and store assets for emails in a separately managed location such as an S3 bucket. 4) I still wouldn't recommend adding non-hash assets. It's probably the easiest thing to do, but also will likely give you the most unexpected results.

Fjan commented 7 years ago

There several factors that can prompt the hash to change, such as images being recompressed, someone changing the asset version string, a change of asset pipeline, or perhaps another change of hashing algorithm. And although Sprockets does not erase the last three versions of assets, that does not mean you can rely on them still being there forever because rebuilding the server or VM will not regenerate the outdated assets. Given enough time one of these will happen and your old emails will appear broken.

So I would say manually versioning images in email is better than using the asset pipeline. Keep in mind that emails already inline CSS so the only way the email will change is if someone overwrites an image with a different one while keeping the same file name. The risk of that happening seems to me to be more acceptable than the brittleness introduced by adding a hash to an email that will probably change at some future date.

silva96 commented 7 years ago

For some reason I haven't dig into, the app is keeping only one version of images, while 3 versions of js/css

I cloned the hard drive of the production server into a brand new instance server, so OS software will be updated but I get all the releases folders (we are using capistrano), which should include the rails4-generated assets. After deploying to this new instances, images are replaced with the new (sha256-filenames) ones. Can be something related with the old manifest (rails 4.0.2 manifest) not being taken in account when deploying rails 5, though wiping everything and recreating them again?

What I'm afraid is this situation:

User receives an email generated by any of the rails 4 instances, minutes later we deploy the rails 5 release into the new instances, put them in the loadbalancer and remove the rails 4 instances from loadbalancer.

in the rails 5 instances capistrano will precompile all assets and for the (unknown) behaviour explained above, old images are deleted. That would result in a "very recently sent" email be broken for the user.

The approach of using image_url('image.png', digest: false) in the emails, at least for a while before the upgrade and after the upgrade doesn't sound bad for me.

aside from my specific problem/solution, I don't see the point of changing the filename to sha256, I clearly understand and agree on using sha256 on the integrity hash though.

schneems commented 7 years ago

emails already inline CSS

Where is this documented?

Fjan commented 7 years ago

Email has never support externally linked style sheets (so, no docs)

silva96 commented 7 years ago

It's not documented, is recommended. gmail for example ignores