sbt / sbt-assembly

Deploy über-JARs. Restart processes. (port of codahale/assembly-sbt)
MIT License
1.95k stars 224 forks source link

do not buffer entry contents in memory while doing MergeStrategy.deduplicate #520

Closed shuttie closed 8 months ago

shuttie commented 8 months ago

Context

Currently the MergeStrategy.deduplicate works in the following way:

I'm maintaining a metarank project and with the current approach it requires ~6Gb heap to make the assembly with sbt -mem 6000 assembly. With lower heap sizes the thing OOMs:

java.lang.OutOfMemoryError: Java heap space
        at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
        at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
        at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
        at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85)
        at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62)
        at scala.collection.generic.Growable$$Lambda$118/0x000000010021e840.apply(Unknown Source)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.reflect.io.Streamable$Bytes.toByteArray(Streamable.scala:59)
        at scala.reflect.io.Streamable$Bytes.toByteArray$(Streamable.scala:56)
        at scala.reflect.io.Streamable$$anon$1.toByteArray(Streamable.scala:137)
        at scala.reflect.io.Streamable$.bytes(Streamable.scala:137)
        at sbtassembly.MergeStrategy$.$anonfun$deduplicate$3(MergeStrategy.scala:125)
        at sbtassembly.MergeStrategy$$$Lambda$6520/0x0000000101922040.apply(Unknown Source)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.TraversableLike$$Lambda$140/0x000000010025d040.apply(Unknown Source)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at sbtassembly.MergeStrategy$.$anonfun$deduplicate$1(MergeStrategy.scala:125)
        at sbtassembly.MergeStrategy$$$Lambda$5917/0x0000000100db0840.apply(Unknown Source)

But there are some problems with the current implementation:

Even worse, the ByteArrayInputStream re-wrapping happens for all JarEntry items, effectively transforming FileInputStream into ByteArrayInputStream, so you need enough heap to buffer the whole unzipped assembly with all the dependencies.

Proposed solution

In this PR we suggest to make deduplication work without caching all entry content in heap:

Risks

The main risk of such approach is that we need to read all assembly content twice: first time while deduplicating, and then later while writing the final assembly jar.

But considering the fact that reading the same file twice puts it into an OS file cache, we assume that such drawback won't affect the overall sbt-assembly performance.

Benchmarks

Before:

$ sbt -mem 6000
[info] started sbt server
sbt:metarank> assembly
...
[info] 156 file(s) merged using strategy 'Rename' (Run the task at debug level to see the details)
[info] 632 file(s) merged using strategy 'Discard' (Run the task at debug level to see the details)
[info] 15 file(s) merged using strategy 'First' (Run the task at debug level to see the details)
[info] 2 file(s) merged using strategy 'Concat' (Run the task at debug level to see the details)
[info] 21 file(s) merged using strategy 'FilterDistinctLines' (Run the task at debug level to see the details)
[info] 10486 file(s) merged using strategy 'Deduplicate' (Run the task at debug level to see the details)
[info] Built: /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
[info] Jar hash: c64a43ce76f78e0734eda88de7daffe53cc51dc8
[success] Total time: 94 s (01:34), completed Mar 11, 2024, 2:58:46 PM

$ sha256sum /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
2fd43f4fcbecf3ef1c8dbe78392ad0c1cd3c6c505d65e70ff33f2277f6a64ee9  /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar

After:

$ sbt -mem 1000
[info] started sbt server
sbt:metarank> assembly
...
[info] 156 file(s) merged using strategy 'Rename' (Run the task at debug level to see the details)
[info] 632 file(s) merged using strategy 'Discard' (Run the task at debug level to see the details)
[info] 15 file(s) merged using strategy 'First' (Run the task at debug level to see the details)
[info] 2 file(s) merged using strategy 'Concat' (Run the task at debug level to see the details)
[info] 21 file(s) merged using strategy 'FilterDistinctLines' (Run the task at debug level to see the details)
[info] 10486 file(s) merged using strategy 'Deduplicate' (Run the task at debug level to see the details)
[info] Built: /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
[info] Jar hash: c64a43ce76f78e0734eda88de7daffe53cc51dc8
[success] Total time: 70 s (01:10), completed Mar 11, 2024, 3:23:37 PM

$ sha256sum /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
2fd43f4fcbecf3ef1c8dbe78392ad0c1cd3c6c505d65e70ff33f2277f6a64ee9  /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar

shuttie commented 8 months ago

@eed3si9n what's your opinion on this PR? It changes the default behavior a bit (as in some cases it reads source streams twice), but memory saving for large assemblies is quite big

eed3si9n commented 8 months ago

Worth trying out? If it causes issues for some ppl we can always back out.