File hash does not change after its content updates

h2jorm commented 6 years ago

This is a 🙋 feature request.

🤔 Expected Behavior

An output file hash should change when its content updates.

😯 Current Behavior

File hash does not change.

🔦 Context

<!-- index.html -->
<html>
<body>
  <script src="./main.js"></script>
</body>
</html>

// main.js
console.log(1);

When build, 'b695675d84099f097ec37d68c8c83fce.js' generates.

parcel build --no-cache --no-minify index.html

And then, change main.js

// main.js
console.log(2);

Build again.

parcel build --no-cache --no-minify index.html

The javascript file name is still 'b695675d84099f097ec37d68c8c83fce.js'. I am not sure it is the expected behavior or not. However, when I using webpack, the output file hash will change every time its content updates.

🌍 Your Environment

Software	Version(s)
Parcel	v1.1.0
Node	v8.9.1
npm/Yarn	yarn v1.3.2
Operating System	macOS High Sierra 10.13.1

DeMoorJasper commented 6 years ago

As far as i know this is expected behaviour. What does change on file change is the hash contained in the fragment that is used for bundling.

{
  "dependencies":[],
  "generated": {
    "js":"\"use strict\";\n\nalert('helo');"
  },
  "hash": "d99518b2c556df9c6c4d8a2e9bd72423" // <-- This changes on change
}

Being generated here in Asset.js

generateHash() {
  return objectHash(this.generated);
}

The filename hash is however based on following parameters (so build, minify and serve files should have different cache names):

const OPTION_KEYS = ['publicURL', 'minify', 'hmr'];

sudhakar commented 6 years ago

@davidnagli Could you reopen this issue please. IMHO, its okie to not change hash during development build. But for production build, if hash doesnt change even when the content changes, then Cache-Control & ETag headers can not be effectively used.

In my case, I add react.js, react-dom.js etc on to a separate bundle vendor.js, which rarely changes. So I set it to cache for 1yr. If I happen to added one or two more libs, I wouldnt be able to bust the cahce as hash never changes as browser thinks that "I already has this file & no need to ask the server again" :(

devongovett commented 6 years ago

the filenames are not currently generated based on the hash of contents. you could do something like versioning, e.g. http://mycdn.com/v1.0.0/somelib.js. when you publish a new version, the url will change.

sudhakar commented 6 years ago

yes @devongovett thats a good idea. But it would be extra configuration to achieve cache busting. I am happy with current setup for now. But IMO, its nice to have it in the core, so users get cache busting for free!

jouni-kantola commented 6 years ago

Why close an issue that clearly adds value if it would be fixed, @davidnagli?

From my point of view, hashes should always be based on content. Long-term caching should be content based not release based.

devongovett commented 6 years ago

one other reason this will be hard to change is that the filenames need to be generated before the entire contents of the file is available, since assets are processed in parallel.

davidnagli commented 6 years ago

Sorry for closing the issue incorrectly. It was my understanding that this was Parcels expected behavior.

@devongovett So are we going to overhaul the caching system?

eXon commented 6 years ago

Maybe it would be easier to add the hash in the url at the query level instead of the filename. That way you can easily cache my-app.js?hash-here without having to change them for real. It's the best world of both.

chee commented 6 years ago

What about creating a random string (build time, perhaps) and using it in the filename?

like

filename = `${hash}.${buildstamp}.${ext}`

With a buildstamp of Date().now().toString(36) you'd get a filename like:

d710beaad39d4ee3906c24983931b45b.jb47tk6c.js

You would get a cachebust file with every build, and the contents of the file would not need to be known before the filename.

(the buildstamp would be the same for every file in that build) (entry would not get stamped)

DeMoorJasper commented 6 years ago

If this is about releasing code/cache issues with users, why not just append the version number found in package.json to the hash or before the extension? This would need to get implemented in parcel, not in an after-building script or whathever

jouni-kantola commented 6 years ago

This is a typical scenario I think should be supported. I'm addressing this from a web performance/UX perspective rather than DX.

I have 1/n vendor bundles with filenames including a hash
I have n code splitted bundles with application code, styles, etc
I fix a bug in an application module, and release
Client only needs to download that specifically code splitted bundle where I fixed the bug

A scenario like this could save the client from re-downloading hundreds of KiB's. If the whole release would be versioned everything would be cache busted.

chee commented 6 years ago

@DeMoorJasper i'm sure there are many people using npm for managing their codebase who aren't bumping their version number every time they make a change, because they aren't publishing it as a module.

think a continuous deployment setup where there are several people merging pull requests into a main branch that's being built on the server and sent down the tubes.

They'd, as i would, want the cache to be bust by build rather than by the version number (which may not have changed).

When code-splitting, it'd be great for a built file's name to be the same as last time unless a dependency changed, so the same thing never needs to be redownloaded if it isn't changing.

DeMoorJasper commented 6 years ago

@chee was just an idea totally forgot about browser cache and web performance, wondering how this would be implemented. Now i'm leaning more towards your timestamp approach

vforvalerio87 commented 6 years ago

This should be really fixed imho; I just implemented parcel in a project and everytime I make any change to js or css I have to manually add a progressive number and change the reference in the html, otherwise when I deploy to production (which has browser caching and a CDN) the server won't give me the updated version of those files. In my opinion the best approach would be the content checksum approach.

shawwn commented 6 years ago

Since this problem only impacts production builds, one solution would be to stick a random query parameter (or current timestamp) in the bundled HTML.

E.g.

  <script
src="/dist/4bf9825be5009102663282d9e776881e.js?192832984"></script>

We should try to avoid content hashing if possible. Although it seems like the correct solution, it's currently possible to generate the bundled name (/dist/${hash}.js) with no performance penalty at all, and without needing to access the contents of the original JS file. It's based solely on filename.

This is important, because Asset generation happens in a subprocess. In that child process, Assets don't have access to the contents of any other asset.

An example would be helpful. StylusAsset calls addURLDependency like so:

    // Setup a handler for the URL function so we add dependencies for
linked assets.
    style.define('url', node => {
      let filename = this.addURLDependency(node.val, node.filename);
      return new stylus.nodes.Literal(`url(${JSON.stringify(filename)})`);
    });

addURLDependency returns the bundled filename, e.g. /dist/${hash}.png.

If we were to change our hashes to require the contents of the other asset, not just its filename, then this would break our parallelism. Every asset would need to either get the contents of all its referenced assets, or query the parent process for the correct hash of a filename. Either way, that would require waiting on data. Right now we don't need to do that.

We should maintain the parallel nature of Parcel as much as possible. And part of that parallelism is the fact that each asset never needs to examine the contents of any other asset for any reason.

Therefore, if we could stick random query parameters into the bundled HTML, that seems like the best cache busting solution.

On Wed, Dec 20, 2017 at 11:44 AM, Valerio Versace notifications@github.com wrote:

This should be really fixed imho; I just implemented parcel in a project and everytime I make any change to js or css I have to manually add a progressive number and change the reference in the html, otherwise when I deploy to production (which has browser caching and a CDN) the server won't give me the updated version of those files. In my opinion the best approach would be the content checksum approach.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/parcel-bundler/parcel/issues/188#issuecomment-353133048, or mute the thread https://github.com/notifications/unsubscribe-auth/AADo8FzEIAEah-ICYrgCf8wFC4m6H2Ksks5tCUdqgaJpZM4Q8aQh .

shawwn commented 6 years ago

When code-splitting, it'd be great for a built file's name to be the same as last time unless a dependency changed, so the same thing never needs to be redownloaded if it isn't changing.

Good point. The query parameter should be the UTC mtime of the source asset file, not random. That will preserve caching.

One way to do this without breaking asset-level parallelism is to modify HTMLPackager and CSSPackager to scan for bundled URLs (/dist/${hash}.${ext}) and substitute in the query parameter (/dist/${hash}.${ext}?${mtime}).

That can happen as a post-processing step, after bundling. It should be possible to do this efficiently.

That will preserve all the previous advantages, like the fact that assets can generate bundled filenames without needing the contents of the other assets or querying the parent process.

vforvalerio87 commented 6 years ago

The mtime approach is undesirable though because it busts cache for every asset every time. I'd rather stick to manually renaming assets in that case because I don't want every user to re-download everything again every time I deploy a website (possibly numerous times per day)

shawwn commented 6 years ago

Yes, mtime was a dumb idea. It wouldn't work for buildservers, for example, or if you re-cloned a repository.

It occurs to me that just before packaging, all of the assets' content hashes are known. (asset.hash)

So we can do the same thing that my previous comment outlined, but put ${asset.hash} into the query parameter rather than mtime.

In other words, the correct time to handle this is during packaging, rather than during asset generation. Packaging happens in the main process, so we have access to the content hashes.

The only problem is how to correctly re-link the dependencies in generated source code (e.g. how to change occurrences of "/dist/${hash}.js" to "/dist/${hash}.js?${asset.hash}" without causing problems or analyzing the generated code) but a simple string substitution might work in all cases.

On Wed, Dec 20, 2017 at 12:54 PM, Valerio Versace notifications@github.com wrote:

The mtime approach is undesirable though because it busts cache for every asset every time though. I'd rather stick to manually renaming assets in that case because I don't want every user to re-download everything again every time I deploy a website (possibly numerous times per day)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parcel-bundler/parcel/issues/188#issuecomment-353150863, or mute the thread https://github.com/notifications/unsubscribe-auth/AADo8H0P4jFCYU_9YFOs6Y8vlvY_rbR3ks5tCVfTgaJpZM4Q8aQh .

jouni-kantola commented 6 years ago

To be honest, I'd much rather have a slow build where hashing was content based, than having users re-downloading assets when they shouldn't need to. What about a flag to the CLI?

Munter commented 6 years ago

Here are some learnings from Assetgraph, where we solved the same issue.

You absolutely want to do content hashing to you can achieve deterministic content addressable file names that lend them selves well to far future cache expiry. Random build-specific hash busts the cache too often. Query parameters aren't always treated correctly by proxies in between the server and client.

You do however not need to do content hashing a lot of times. You can get away with doing them once, at the point where you know you are done making source code modifications and are ready to write out to disc.

The hash renaming must be done in a depth first post-order graph traversal to ensure content hashes updating all the way up to the entry points when deeply nested dependencies update. Any other traversal algorithm will result in caching errors

chee commented 6 years ago

Query parameters aren't always treated correctly by proxies in between the server and client.

Some proxy software classifies anything with a query string as dynamic content, and so does not cache it at all. This is, for instance, Squid's default behaviour.

benhutton commented 6 years ago

@devongovett what are you thinking about this one? Is asset fingerprinting something that you agree should be baked into Parcel's core? Is it something we should try to figure out how to add the correct hooks to write a plugin for? Is it something I should try to find some other way to write a post-processor to accomplish?

shanebo commented 6 years ago

@devongovett, me and @benhutton are willing to invest some time and energy into this fingerprinting issue but we don't want to head in an implementation direction you aren't a fan of. Would you be open to putting some thought into this with us so we can try and work on a PR solution, or plugin?

shawwn commented 6 years ago

Hey Shane, sure thing. Hop on our slack and ping me: https://slack.parceljs.org/ (I'm @shawwn)

On Tue, Jan 9, 2018 at 3:34 PM, Shane Thacker notifications@github.com wrote:

@devongovett https://github.com/devongovett, me and @benhutton https://github.com/benhutton are willing to invest some time and energy into this fingerprinting issue but we don't want to head in an implementation direction you aren't a fan of. Would you be open to putting some thought into this with us so we can try and work on a PR solution, or plugin?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/parcel-bundler/parcel/issues/188#issuecomment-356421411, or mute the thread https://github.com/notifications/unsubscribe-auth/AADo8K5fHqo6nk3E3xvfPRO6eb5v8RjVks5tI9tqgaJpZM4Q8aQh .

devongovett commented 6 years ago

@shanebo @benhutton I think this will be very difficult to achieve in the current parallel architecture. We can't know the content hash until all assets have been processed, but we need to know the final bundled URL during asset processing so URLs can be placed in the right places (e.g. CSS files linking to images, HTML files linking to CSS, etc.).

If you have suggestions for ways around this, or alternative hashing/versioning strategies, let me know!

chee commented 6 years ago

@devongovett what if the files were generated as template files, with placeholders for the paths

<script src="{{main.js}}"></script>

then those are compiled afterwards in a separate operation?

(so the hash of the file would be the content of the file with the template markers in it)

the difficulty might be if a codesplit dependency changes, but the parent does not. that the parent would still need cachebust even though nothing is different.

Munter commented 6 years ago

@chee oh no, please don't invent conventions that take the source code away from working in a browser :'(

Munter commented 6 years ago

@devongovett

We can't know the content hash until all assets have been processed, but we need to know the final bundled URL during asset processing so URLs can be placed in the right places (e.g. CSS files linking to images, HTML files linking to CSS, etc.).

Why do you need to know the final URL before all assets are processed? What if you just used the current names as a temporary name. You could actually keep it like this in the development environment, since content addressable URL's are more of a production feature.

When you do a production build you could just rename the files once more and update the references to them. Or does the build pipeline somehow lose the references to an asset in the middle of the pipeline?

benhutton commented 6 years ago

Likely what needs to happen, either as a plugin or as part of the parcel core, is some sort of post-processor that does the depth first post order traversal mentioned at the bottom of https://github.com/parcel-bundler/parcel/issues/188#issuecomment-353896130

We assemble the full and final graph and do a quick walk through it, renaming files and then re-referencing them further up.

It should be relatively fast, but I agree that it only needs to happen in production. It could easily be hidden behind some sort of flag on the executable.

As to why we care about this particular strategy so much: it's the only thing that seems to work reliably with CDNs. (That we know of! If someone else has a better solution, I'm all ears.)

Query strings are non-ideal, as mentioned here: https://github.com/parcel-bundler/parcel/issues/188#issuecomment-353896130. They also don't play nice with image transformation SaaS like imgix. But even if we did go with a query string, it would have the same post-processing concerns that we have here.
Versioning the whole directory is both error-prone and leads to overly-aggressive invalidation
Etags would be the remaining option, but that takes the CDN approach off the table, removing our assets from the edge locations and potentially bogging down our servers with a ton of unnecessary asset queries.

@shawwn, @shanebo and I will try hitting you up on slack later today to talk through this more.

chee commented 6 years ago

@Munter i'm talking about doing this as part of the compile step, not as something the developer would have to do, and it would go away.

Munter commented 6 years ago

@benhutton I just jumped on your slack as well (@munter). Feel free to ping me if you need any feedback on how we implemented this in assetgraph. I don't know if the models are close enough to each other to be able to do the same, but it feels close from inspecting the sources here

DeMoorJasper commented 6 years ago

@benhutton what about just keeping the current naming-system (for initial and development naming) but renaming all references at the end of a production build to the content hash, it's sort of the same as what @chee and u suggested but i'm pretty sure it'll be way easier to implement

benhutton commented 6 years ago

@DeMoorJasper I think that maybe we're talking about the same thing? Only change things for production, and do it at the end.

I don't think there is any way around doing a tree traversal, though. That is, I think that this algorithm will NOT work:

Find the md5 hash of every file.
Rename those files to include the hash.
Go back and edit the files to include references to the new file names with the hashes.

Instead, we need to do the tree traversal that @Munter described.

Find your graph.
Find a leaf node.
Hash, rename.
Update references to that file.
Walk the tree back up, repeating for each file.

The idea is that when any given node changes, all the nodes above it will end up changing too as the references trickle up. And any nodes that are NOT affected will NOT change. So you are busting exactly the right caches at the right time.

Here's the big principle: A file doesn't get edited after it gets hashed. The hash is of the FINAL content of that file.

@DeMoorJasper does that make sense? @Munter am I describing the algorithm you had in mind accurately?

Munter commented 6 years ago

@benhutton That is the exact right algorithm and the correct reason you describe.

This image always helps me visualise it best: Traversal order: A C E D B H I G F

It's still important to start at your entry point(s) and just remember to put the hashing logic after child traversal. This is the what we do in AssetGraph: https://github.com/assetgraph/assetgraph/blob/master/lib/AssetGraph.js#L445-L462

When you extend Parcel with multiple entry points you probably want to keep track of seen assets to avoid double work as well

fritx commented 6 years ago

Is there any workaround for now?

augnustin commented 6 years ago

I completely :+1: the MD5 hash naming strategy and I'm glad this was the final pick. Parcel is a lot beautiful because it is plug and play, and it needs to remain this way!

Looking forward to seeing this available in production. Any idea if this would be within few months or much more than that?

Cheers

augnustin commented 6 years ago

I'd like to mention that IMHO this issue is top priority:

For now, my deploys are completely random. I try many things before I can serve the last version of my assets. Among those:

assets rebuild
rm -rf public/* && assets rebuild
service nginx restart

still I get unpredictable results...

This makes it unusable in real production context.

augnustin commented 6 years ago

Ok I fixed my issue by doing rm -rf .cache. This might be another issue but I'm reporting here in case someone faces the same situation. I'll create the other one when I have more predictable results to share.

devongovett commented 6 years ago

Should be solved by #1025 which generates content-hashed filenames for static assets. Please help test using the master branch - a release will hopefully come next week!

augnustin commented 6 years ago

Wonderful! Great reactivity.

Definitely willing to test it as soon as it is released. If you can here or in #1025 this would be perfect.

Cheers

parcel-bundler / parcel