sbt / zinc

Scala incremental compiler library, used by sbt and other build tools
Apache License 2.0
333 stars 118 forks source link

Make classpath hashing more lightweight #433

Closed fommil closed 6 years ago

fommil commented 6 years ago

It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.

Possibly there could be some trigger that the file timestamp must be older than an hour or something... to avoid user interaction problems.

NIO allows a fast mechanism for collecting bulk metadata, it should be possible to traverse a directory and get all of this information very quickly before deciding what needs to be rehashed.

fommil commented 6 years ago

See link for a proof of concept that basically removes the bottleneck entirely. Should really be implemented in the zinc layer but I have no way of doing that and testing it in a live sbt.

jvican commented 6 years ago

It would be good to have the option to bypass the calculation of a file hash code if the size has not changed and the timestamp has not changed (just the timestamp should be ok). This has huge performance potential... avoiding doing work at all.

We can use the checksums in the same jars to detect whether something has changed or not (all zip files have checksums in their headers). In fact, it's not clear why we compute a hash of the jars instead of reusing the checksum in the header.

fommil commented 6 years ago

@jvican ha! That's a pretty good idea. Although I still think file size / timestamp is a pretty damn good low bar for everything. But having custom hashes sounds like an excellent idea, especially if the file already has a checksum!

fommil commented 6 years ago

For completeness I raised PRs against launcher and sbt to enable monkey patching to let me play around. Until they are working and merged I'm afraid I can't help.

https://github.com/sbt/launcher/pull/43

https://github.com/sbt/sbt/pull/3643

jrudolph commented 6 years ago

I second the idea of using less safe but faster comparison options. Not sure about zip file signatures. Afaik zip files don't contain a hash over all the content but only crc32s for each file which would mean that you would have to read (probably significantly) into the content to collect all the hashes (but would probably still be faster than reading all the data). Getting just file system metadata should be still faster.

fommil commented 6 years ago

Clear reproduction in https://github.com/cakesolutions/sbt-cake/tree/sbt-perf-regression

type sbt cpl to compile all test phases.

jvican commented 6 years ago

Getting just file system metadata should be still faster.

This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it.

I think that reading checksums from the zip file could be done in a fast way with some tweaks.

jrudolph commented 6 years ago

This is not viable. System metadata is really unreliable, and I would not like the incremental compiler to misbehave because of it.

Hmm, relying on modification times, sizes, and other stat information is also how git and rsync work in the default cases. If it is reliable enough for those kinds of more critical infrastructure, wouldn't it be for zinc as well?

The incremental compiler is an optimization that will often be used in an interactive system. In that case, the system metadata should just be assumed to be reliable enough to be used. If it is not (for whatever reason) it should be the user who decides to switch to an even more reliable mechanism.

fommil commented 6 years ago

system metadata like timestamps and file size? I have never seen that to be incorrect. Given that we're talking about external dependencies, it would be a fair assumption that they never change.

I know for a fact that scalac can't handle the case where a jar file changes... the whole classpath needs to be reloaded and I doubt zinc is even considering that.

The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata.

jvican commented 6 years ago

Hmm, relying on modification times, sizes, and other stat information is also how git and rsync work in the default cases. If it is reliable enough for those kinds of more critical infrastructure, wouldn't it be for zinc as well?

I don't have any data to show you because I've rarely had problems with it, but I can point out two facts:

  1. The Zinc compiler hashes source files to detect changes because there were problems in the past with just checking the file metadata. If that happened with source files, it will also happen with jars. Mark Harrah documented it somewhere.

  2. Pants and Bazel exist in part because you cannot trust system's metadata -- I believe reading they had some issues with it. As a result, they hash everything.

The incremental compiler is an optimization that will often be used in an interactive system.

The incremental compiler is the default way of using the Scalac compiler, I doubt users would tell you they consider it an optimization. Zinc is just another compiler and as such it's better if it's as reproducible as possible.

I know for a fact that scalac can't handle the case where a jar file changes... the whole classpath needs to be reloaded and I doubt zinc is even considering that.

That has to do with the classpath handling of scalac, it's not related to this ticket. I also believe it's improved in the last release IIRC.

The real hash can look inside the file, but the cache of the hash (which is what we're talking about here) is perfectly fine using file metadata.

Which cache of the hash? We're discussing what Zinc should do to check if the classpath has changed, and for that it currently uses hashing of the contents. There's no cache of the hash...

I believe that either the CRCs in the jar file or plain hashing (like we're doing right now) are the only two options here. I remind you that the reason why it's so slow is because it does it transitively for all projects, not because doing so is inherently slow.

If we assume that we go with the hashing aproach, the effect should be minimal. For example, xxHash can do 5.4GB/s. Even if we assume a gigantic classpath of 1gb, that would not even be more than 250 ms (giving it some margin for IO and latency).

fommil commented 6 years ago

the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io.

This is proposed as a 1.0.x binary compatible solution, so changing the actual hashing algorithm is out of scope and considered elsewhere.

jrudolph commented 6 years ago

the resolution to this ticket is to add a cache around the hash, using filesystem metadata to do so. See my Proof of Concept to sbt-io.

Exactly, so maybe this was just a misunderstanding? The idea was not to remove hashes completely but to avoid recalculation when files have not changed.

jvican commented 6 years ago

No, I really meant it: using filesystem metadata to figure out if a file has changed or not is a bad idea.

I'm all for removing the duplication for the transitive jar hashing that happens in multi-build project, but I don't think that the caching should do more than just that.

fommil commented 6 years ago

@jvican I'd need to see some solid evidence of the meta data being unreliable before I could agree with you. Removing transitives would be an even bigger exposure... what if I went and swapped out scala-library.jar or (more likely, and the reason why I suspect Eugene added jar caching in the first place) some dependent jar that is part of my build? Your approach wouldn't notice this.

Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc. Given that scalac currently can't even handle changing jars, I personally think the whole concept of hashing jars is moot.

jvican commented 6 years ago

Related ticket that explains why source file hashing is used, even though there's not much evidence: https://github.com/sbt/sbt/issues/685. In essence, the precision of lastModified is poor.

I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via packageBin).

leonardehrenfried commented 6 years ago

I would also like to say that I am in favour of a very rare case of undercompilation (which can be resolved by a clean) rather than a constant overhead to catch a rare edge case.

jrudolph commented 6 years ago

I've had a private chat with @fommil and it looks like we've converged to use the filesystem metadata for jars, given that the jars mostly come from the ivy or coursier cache and not from the build (via packageBin).

Sounds good.

Timestamps can miss changes with a fidelity of 1 second (and the OS could lie and retain the original timestamp). File sizes can also lie (a single char change). But the combination of both of these things means that the probability of not picking up on the change is so low that it isn't worth considering. And a workaround is always to restart sbt or zinc.

Yes that's a good point. The link to the description of the git modification detection logic also talked about that. They also use the last modification check to decide whether a change might have changed in the same second as the hash has been computed or not.

stuhood commented 6 years ago

It is possible to have both correctness and performance for this case if file watching is used to invalidate cache entries. sbt already has facilities to do this, afaik (as do all other consumers of zinc), so exposing a callbacky invalidation API would allow callers to bring their own invalidation. I believe @romanowski opened a PR for such a thing...?

Assuming that that invalidation API exists and is used by the relevant consumers, having a default/fallback impl that used filesystem metadata would be ok.