Closed shuttie closed 8 months ago
@eed3si9n what's your opinion on this PR? It changes the default behavior a bit (as in some cases it reads source streams twice), but memory saving for large assemblies is quite big
Worth trying out? If it causes issues for some ppl we can always back out.
Context
Currently the
MergeStrategy.deduplicate
works in the following way:I'm maintaining a metarank project and with the current approach it requires ~6Gb heap to make the assembly with
sbt -mem 6000 assembly
. With lower heap sizes the thing OOMs:But there are some problems with the current implementation:
conflictContents = conflicts.map(_.stream()).map(Streamable.bytes(_))
buffers all duplicate contents in a heap buffer. There are some examples like ONNXRuntime which bundle a huge 500Mb binary inside the jar, and buffering it in RAM might not be perfect approach.JarEntry(conflicts.head.target, () => new ByteArrayInputStream(conflictContents.head)
takes a byte buffer with contents and wraps it as an InputStream. Given thatDependency
already technically is a stream (usually a file-based), there is no need to re-wrap it into a ByteArrayInputStream and just reuse the.stream
from the dependency as is.Even worse, the
ByteArrayInputStream
re-wrapping happens for allJarEntry
items, effectively transformingFileInputStream
intoByteArrayInputStream
, so you need enough heap to buffer the whole unzipped assembly with all the dependencies.Proposed solution
In this PR we suggest to make deduplication work without caching all entry content in heap:
sbtassembly.Assembly.sha1Content
operate onInputStream
and not onbyte[]
FileInputStream
as a heap-bufferedByteArrayInputStream
Risks
The main risk of such approach is that we need to read all assembly content twice: first time while deduplicating, and then later while writing the final assembly jar.
But considering the fact that reading the same file twice puts it into an OS file cache, we assume that such drawback won't affect the overall sbt-assembly performance.
Benchmarks
Before:
After: