parcel-bundler / parcel

The zero configuration build tool for the web. πŸ“¦πŸš€
https://parceljs.org
MIT License
43.46k stars 2.27k forks source link

πŸ™‹ Use MD5 of File Content As Name To Bust Caches #717

Closed benhutton closed 6 years ago

benhutton commented 6 years ago

This is something my team is interested in implementing, but want to make sure we're not heading in the wrong direction...

πŸ€” Expected Behavior

In order to serve assets through a CDN, we need to use a hash of the file contents as part of the filename. This is the only way to ensure that cache busts at the right time and only at the right time.

This issue has been discussed extensively here: https://github.com/parcel-bundler/parcel/issues/188. We have discussed it a bit on slack start here: https://parcel-bundler.slack.com/archives/C8QC15PNY/p1515609163000373

To be specific, I am proposing files be named ${contentHash}.${ext}}

😯 Current Behavior

Currently, the filename is a hash based on the name of the file (and some options like whether or not hmr is enabled, I believe). Something like ${fileNameHash}.${ext}}

πŸ’ Possible Solution

@shawwn proposes parts of the solution here: https://github.com/parcel-bundler/parcel/issues/188#issuecomment-353141569

@Munter describes how tree traversal needs to work to make the cache bust in all the right places here: https://github.com/parcel-bundler/parcel/issues/188#issuecomment-357072153

One key piece of the puzzle will be to only enable this behavior on build (ie, when building for production). Leave it off in the development modes, so you don't have to worry about

πŸ”¦ Context

We want to serve our assets (css, js, images) through a CDN like CloudFlare or CloudFront. If something about the url doesn't change, then new versions of those assets will never make it to the end user. If too much changes, then we bust the cache too aggressively.

DeMoorJasper commented 6 years ago

isn't this just a duplicate of #188 ?

benhutton commented 6 years ago

@DeMoorJasper this is #188 turned into a specific feature proposal, instead of a bug report.

benhutton commented 6 years ago

Well, I guess #188 is a feature request, not a bug report. What's different is that this is something I'm proposing to build, and looking for one last sanity check before starting.

devongovett commented 6 years ago

Can you describe in more detail how you plan to implement this? I'm specifically interested in how you would update references to the output files after they are renamed.

benhutton commented 6 years ago

@devongovett this is where I'm still a bit hazy and hoping ya'll can offer me some pointers. I see it all as depending on how tree traversal happens at the packaging level, and I'm not really straight on how that happens.

Assuming tree traversal happens the right way (namely, like this: https://github.com/parcel-bundler/parcel/issues/188#issuecomment-357072153, hitting child nodes before parent nodes),

mqudsi commented 6 years ago

Tangential correction: don’t use md5, it’s too insecure for use as a cryptographic hash function and too slow for use as a general-purpose hash function.

XXHash is the most optimal choice for across-the-board performance, MetroHash for pure performance on modern CPUs with 64-bit support, and SipHash for a high-performance, cryptographically secure primitive.

TimNZ commented 6 years ago

Please implement this.
Otherwise cache busting management is a bit painful

I think anyone questioning the need for this doesn't spend a lot of time on making browser apps get served quickly and scalable so page load is minimised after an asset update/deployment.

Much less of an issue with SPA, but not ready to be discarded as a concern yet.

@mqudsi why does the hashing have to be secure for a file name generator? Main consideration is collisions, and that's pretty unlikely with any hashing algo.

benhutton commented 6 years ago

@mqudsi and others...

Parcel already uses md5 via https://github.com/parcel-bundler/parcel/blob/master/src/utils/md5.js. I would just continue to do the same, maintaining status quo. Switching from md5 to another algorithm is certainly interesting, but probably out of scope of the discussion here β€” maybe propose that as a separate Github Issue??

shunia commented 6 years ago

This is a pretty urgent feature which needs to be implemented ASAP. Or at least we need an official plugin to help fitting the same needs.

All the info needed has been discussed in the issue mentioned above, and the detail description by @benhutton .

As more and more people may step into the parcel world, there will be more and more cache needs out there. As a general solution of bundlers of the javascript ecosystem, content hash is a must have feature here, inside the parcel ecosystem.

Please @benhutton and the parcel guys, fill this hole in!!

Because we are going for production, too. :)

benhutton commented 6 years ago

@devongovett any feedback on https://github.com/parcel-bundler/parcel/issues/717#issuecomment-362325836 ?

benhutton commented 6 years ago

@devongovett @shawwn here are some first naive attempts at stumbling towards a solution for this:

  1. We need to get packaging to happen in the right order. So move https://github.com/parcel-bundler/parcel/blob/master/src/Bundle.js#L102-L104 as far down as possible. Perhaps to either right before or right after the mapping packaging that happens in lines 115–117. Is this okay to do?

  2. Right after this._package() is called, IF we are building for production, do the following:

    1. Change the name of the bundle (so, this.name) to be based on the current value of hash, which I believe is calculated from the contents of all of the bundle's assets.
    2. Move the file that was just written to a new location based on the new name.
TimNZ commented 6 years ago

Guess I'll have to use Webpack for production builds.

DeMoorJasper commented 6 years ago

@benhutton sourcemap packaging should always come last, it needs the position of the code inside the other bundles to function

Sent with GitHawk

Munter commented 6 years ago

@TimNZ When you read up on why webpacks contenthash is actually not deterministic and can't be used reliably for content addressability you might reconsider that position. I've had multiple serious javascript deployment problems caused by wrong hashing in webpack

shanebo commented 6 years ago

@DeMoorJasper @Munter, we all want to use Parcel. That's why we're here. But if parcel doesn't have a solution for cache busting only the files that have changed how can we use it in production? To ask a different way, how are the rest of you who are using it in production cache busting only the files that have changed?

TimNZ commented 6 years ago

Someone will say I'm entitled (I'm definitely not), but I am perplexed by Parcel maintainers apparent low interest in this, and a minor community engagement.

Along with @shanebo and others I genuinely wonder what deployment strategies people are using for Parcel builds for cache busting/optimisation.

Maybe I'm the one overcomplicating things?

We could just deploy to subdirs for each build e.g. /v1, /v2 etc.

devongovett commented 6 years ago

It's not that we're not interested, we are! I just want to figure out exactly how to do this.

Currently, in my production apps, we just use the package.json version as part of the path. e.g. http://mycdn.com/v1.2.3/index.js Yeah it will end up invalidating more than needed but it works for now until we figure out how to get better hashing.

Here are the questions I'd like to get answered:

  1. Do we hash all files or just some? For example, would you want index.html (the entrypoint) to become index.af137g.html? If not, then let's define the conditions where we hash and when we don't.
  2. Renaming of the files needs to happen at the end, after they are written since the hash is based on file contents. So the question is how to update references to files. For example, index.html might reference a JS file, but the hash for the JS file won't be known at the time the HTML file is generated. So we'll have to update those references somehow once all files have been hashed. How should we go about implementing that?
devongovett commented 6 years ago

(2) might involve re-parsing. Or maybe we could store some location information about the references somewhere, e.g. (offset, length) tuples. Then at the end some process could replace those locations in the string with the updated references.

Munter commented 6 years ago

@devongovett Doesn't Parcel keep the live AST's around that it first found the original reference in? That would give you a hook to know what AST/DOM/CSSOM node the reference to the file exists in.

About what files to rename and which not to, I wrote this comment in another issue, which outlines how Assetgraph goes about hashing files and determining which ones to avoid hashing: https://github.com/parcel-bundler/parcel/issues/280#issuecomment-353894742

In assetgraph-builder we also hash filenames, but exclude any files matching the following from renaming:

  • Any graph entry point (usually html)
  • Any asset linked to with an internal <a href>
  • Any asset linked to with a <meta http-equiv="refresh">
  • Serviceworkers (must keep consistent file name across builds)
  • humans.txt, robots.txt, .htaccess
  • Cache manifests, rss and atom feeds
  • favicon.ico

More rules might need to be added in the future of course, but these ones have covered us well for a couple of years

TimNZ commented 6 years ago

Thanks for quick reply @devongovett.

As @Munter just posted whilst I was writing this, for (2), the entry point file should never be renamed, we can easily control the naming of that.

At the end of the build don't you have a tree with files and and file references. Could you just tack on after this step a content comparison and generate a different filename if content is different from last hash, updating linked references. If not already being done, do you need to keep a global track of all files and references to update them post build.

Does it need to be more complex than that? I don't mean that as a throwaway comment, as that change may have serious impact on current Parcel architecture.

devongovett commented 6 years ago

The ASTs don't stay around unfortunately, because the processing for each asset occurs in workers and it would be too slow to send the whole AST across the process boundary. Instead, we send the generated code for each asset (e.g. JS, HTML, etc) as a string, and a Packager object on the main process combines the code from each asset together into the final bundle.

If we sent a list of references with location info along with the generated code, we could replace them in the string as part of the packager. Perhaps we could use sourcemaps to get the mapped location in the code, however it's unclear how we'd do this for asset types that don't have sourcemaps like HTML/CSS.

Munter commented 6 years ago

CSS has sourcemaps, but html, svg, rss/atom feeds (generally all xml-ish files) are a real problem if you don't have the references alive.

Maybe the idea of relying on the original unique hashes and then do simple string replacements on them to the final file name is the way to go after all. Of course if would still have to be done in the right order.

Is the dependency graph between files known at this time, or would if have to be re-created from the known unique hashes that are assigned at the beginning of the build?

benhutton commented 6 years ago

I've been experimenting with different strategies for doing this inside of Parcel's codebase. Here are a few that I'm playing around with but don't have completely figured out yet:

  1. At the packaging stage, change the order of packaging so that we essentially get the tree walk that we need. (Is this even possible?) Package up a file, hash it, rename it per that hash, then store what you did to rename it so you can edit other packages you come across that reference it.

  2. At the transition between the asset processing stage and the bundling stage. This idea came from @fathyb. Assets have been processed but not bundled together. You have a working asset tree. So instead of hashing the output file, hash the input file and all of its child dependencies. Rename the asset, and rename references within parent assets.

I'm beginning to feel some of the difficulty that @devongovett is having figuring out what's the best way to do this. There isn't a nice clean way to do this readily jumping out. They're all messy.

benhutton commented 6 years ago

Here's another idea:

Right now, the packager returns a map of bundle hashes. What if, instead, it returned (or also returned) a tree of what files reference other files. I feel like that should be relatively straightforward to do.

Then we could add in another stage that leverages that information to do the hashing, renaming, and rewriting. Let the packager determine the tree, and then build a new object that encapsulates everything we're talking about here that acts on what the packager returns.

DeMoorJasper commented 6 years ago

@benhutton for inspiration on how to implement a file tree, you could have a look at the treeshaking experiment PR #731 It keeps a list of all parent references

benhutton commented 6 years ago

i think i'm getting really close here. hope to have some code to show ya'll Friday or Monday. I realized that what https://github.com/parcel-bundler/parcel/blob/master/src/utils/bundleReport.js is doing might be the key here.

benhutton commented 6 years ago

See #829 for a sneak peak on what I'm trying to do. It seems to be working and passes a few simple tests I throw at it.

pselden commented 6 years ago

Do people have a workaround right now until this lands? Just do a global replace post-build with your own cache busting?

Munter commented 6 years ago

@pselden If you "just" do that, you will very likely end up accidentally not busting the cache of a file that needs to be update. There is a different issue dedicated to discussing and implementing the correct file renaming which guarantees correct file naming for content addressable files

DeMoorJasper commented 6 years ago

@pselden you could create subfolders per version you push, this will however bust a lot of cache, definitely if you push a lot of updates daily. We are discussing the final solution for this and other related naming issues in #872 feel free to append whatever you feel is missing there, this feature will get added once it reaches implementation stage.

fregante commented 6 years ago

HTML files should not have distant expiration dates, so they don't need cache busting and actually can't have it because they are the public URL: they can't change. (related: https://github.com/parcel-bundler/parcel/issues/280#issuecomment-359379338)

Kagami commented 6 years ago

Correct, index pages are always Cache-Control: no-cache. I recommend everyone to read this document: https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/http-caching Caching rules are really simple.

devongovett commented 6 years ago

Should be solved by #1025 which generates content-hashed filenames for static assets. Please help test using the master branch - a release will hopefully come next week!