do not buffer entry contents in memory while doing MergeStrategy.deduplicate

Context

Currently the MergeStrategy.deduplicate works in the following way:

for all conflicting files with the same path compute sha1 hashes
if all conflicting files have same sha1 hash, then use the first one

I'm maintaining a metarank project and with the current approach it requires ~6Gb heap to make the assembly with sbt -mem 6000 assembly. With lower heap sizes the thing OOMs:

java.lang.OutOfMemoryError: Java heap space
        at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
        at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
        at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
        at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:85)
        at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.generic.Growable.$anonfun$$plus$plus$eq$1(Growable.scala:62)
        at scala.collection.generic.Growable$$Lambda$118/0x000000010021e840.apply(Unknown Source)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at scala.reflect.io.Streamable$Bytes.toByteArray(Streamable.scala:59)
        at scala.reflect.io.Streamable$Bytes.toByteArray$(Streamable.scala:56)
        at scala.reflect.io.Streamable$$anon$1.toByteArray(Streamable.scala:137)
        at scala.reflect.io.Streamable$.bytes(Streamable.scala:137)
        at sbtassembly.MergeStrategy$.$anonfun$deduplicate$3(MergeStrategy.scala:125)
        at sbtassembly.MergeStrategy$$$Lambda$6520/0x0000000101922040.apply(Unknown Source)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.TraversableLike$$Lambda$140/0x000000010025d040.apply(Unknown Source)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at sbtassembly.MergeStrategy$.$anonfun$deduplicate$1(MergeStrategy.scala:125)
        at sbtassembly.MergeStrategy$$$Lambda$5917/0x0000000100db0840.apply(Unknown Source)

But there are some problems with the current implementation:

conflictContents = conflicts.map(_.stream()).map(Streamable.bytes(_)) buffers all duplicate contents in a heap buffer. There are some examples like ONNXRuntime which bundle a huge 500Mb binary inside the jar, and buffering it in RAM might not be perfect approach.
JarEntry(conflicts.head.target, () => new ByteArrayInputStream(conflictContents.head) takes a byte buffer with contents and wraps it as an InputStream. Given that Dependency already technically is a stream (usually a file-based), there is no need to re-wrap it into a ByteArrayInputStream and just reuse the .stream from the dependency as is.

Even worse, the ByteArrayInputStream re-wrapping happens for all JarEntry items, effectively transforming FileInputStream into ByteArrayInputStream, so you need enough heap to buffer the whole unzipped assembly with all the dependencies.

Proposed solution

In this PR we suggest to make deduplication work without caching all entry content in heap:

make sbtassembly.Assembly.sha1Content operate on InputStream and not on byte[]
do not re-wrap FileInputStream as a heap-buffered ByteArrayInputStream

Risks

The main risk of such approach is that we need to read all assembly content twice: first time while deduplicating, and then later while writing the final assembly jar.

But considering the fact that reading the same file twice puts it into an OS file cache, we assume that such drawback won't affect the overall sbt-assembly performance.

Benchmarks

Before:

$ sbt -mem 6000
[info] started sbt server
sbt:metarank> assembly
...
[info] 156 file(s) merged using strategy 'Rename' (Run the task at debug level to see the details)
[info] 632 file(s) merged using strategy 'Discard' (Run the task at debug level to see the details)
[info] 15 file(s) merged using strategy 'First' (Run the task at debug level to see the details)
[info] 2 file(s) merged using strategy 'Concat' (Run the task at debug level to see the details)
[info] 21 file(s) merged using strategy 'FilterDistinctLines' (Run the task at debug level to see the details)
[info] 10486 file(s) merged using strategy 'Deduplicate' (Run the task at debug level to see the details)
[info] Built: /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
[info] Jar hash: c64a43ce76f78e0734eda88de7daffe53cc51dc8
[success] Total time: 94 s (01:34), completed Mar 11, 2024, 2:58:46 PM

$ sha256sum /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
2fd43f4fcbecf3ef1c8dbe78392ad0c1cd3c6c505d65e70ff33f2277f6a64ee9  /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar

After:

$ sbt -mem 1000
[info] started sbt server
sbt:metarank> assembly
...
[info] 156 file(s) merged using strategy 'Rename' (Run the task at debug level to see the details)
[info] 632 file(s) merged using strategy 'Discard' (Run the task at debug level to see the details)
[info] 15 file(s) merged using strategy 'First' (Run the task at debug level to see the details)
[info] 2 file(s) merged using strategy 'Concat' (Run the task at debug level to see the details)
[info] 21 file(s) merged using strategy 'FilterDistinctLines' (Run the task at debug level to see the details)
[info] 10486 file(s) merged using strategy 'Deduplicate' (Run the task at debug level to see the details)
[info] Built: /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
[info] Jar hash: c64a43ce76f78e0734eda88de7daffe53cc51dc8
[success] Total time: 70 s (01:10), completed Mar 11, 2024, 3:23:37 PM

$ sha256sum /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar
2fd43f4fcbecf3ef1c8dbe78392ad0c1cd3c6c505d65e70ff33f2277f6a64ee9  /home/shutty/private/code/metarank/target/scala-2.13/metarank.jar

sbt / sbt-assembly