New analysis format - Githubissues

szeiger commented 8 months ago

A new implementation of Zinc's incremental state serialization.

Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format).
A single implementation that supports an efficient binary format for production use and a text format for development and debugging.
Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis.
Smaller output files than the existing binary format.
Faster serialization and deserialization than the existing binary format.
Smaller implementation than either of the existing formats.
Optional unsorted output that trades consistency and small file sizes for much faster writing.

Benchmark data based on scala-library + reflect + compiler:

	Write time	Read time	File size
sbt Text	1002 ms	791 ms	~ 7102 kB
sbt Binary	654 ms	277 ms	~ 6182 kB
ConsistentBinary	157 ms	100 ms	3097 kB
ConsistentBinary (unsorted)	79 ms		~ 3796 kB

This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case).

We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.

szeiger commented 8 months ago

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

Friendseeker commented 8 months ago

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

Maybe we can store them via Git LFS?

I have never used Git LFS, so ideally people with prior experience with it can judge if this is indeed a good idea.

szeiger commented 8 months ago

I don't understand why the benchmarks are failing in CI. I added the new class to the existing benchmark project so it should be found. I was not able to reproduce the problem locally. The benchmark commands from CI run just fine on my machine.

szeiger commented 8 months ago

Oh, it's the "benchmark against develop branch" step that fails. This seems to be the usual undercompilation bug of sbt-jmh. It usually needs an explicit clean when you remove a benchmark. Switching to the develop branch after building from this PR would cause it. This problem should go away after merging, but explicitly cleaning the jmh project would be a better fix.

He-Pin commented 8 months ago

How about : https://github.com/HebiRobotics/QuickBuffers and https://github.com/google/flatbuffers

For the formats? seems fast too.

lihaoyi-databricks commented 6 months ago

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

eed3si9n commented 6 months ago

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

Overall I am onboard with more stable / hermetic analysis serializer. My main line of concern was where the speedup gain was coming from (ExecutionContext.global) and if it would end up competing for CPU attention in non-Bazel usage. Now that ExecutionContext is a parameter, hopefully we can tweak how much thread we assign to this vs other tasks.

lrytz commented 5 months ago

@eed3si9n would you have time to do a 1.10.0-M4 release that includes this PR?

eed3si9n commented 5 months ago

Yea. I'll try to get something out this weekend.

eed3si9n commented 5 months ago

https://github.com/sbt/zinc/releases/tag/v1.10.0-RC1 is on its way to Maven Central.

sbt / zinc

New analysis format #1326