Closed som-snytt closed 11 months ago
I'm on it!
ArrayBuffer.ensureSize
deals with offsets and lengths while ArrayBuilder.ensureSize
takes a size:Int
parameter directly. Although the current version quietly ignores overflowed/negative size parameter values, as size approaches Int.MaxValue
, ArrayBuilder
's naive size doubling policy can cause newsize
to overflow and trap ensureSize
in a long, possibly infinite loop.
See this scastie for a demonstration of the infinite loop (you might not want to run it): https://scastie.scala-lang.org/wKOS1p2mTlaCZE9MwdNg7w
If all we want to do, is protect newsize from overflow, it's a 3 line change from:
protected[this] final def ensureSize(size: Int /* <- overflow impossible. negative sign? */ ): Unit = {
if (capacity < size || capacity == 0) { /* ignore if negative */
var newsize = if (capacity == 0) 16 else capacity * 2
while (newsize < size) newsize *= 2 /* <- this can overflow continually in an infinite loop. */
resize(newsize)
}
}
to:
protected[this] final def ensureSize(size: Int): Unit = {
if (capacity < size || capacity == 0) {
var newsize:Long = if (capacity == 0) 16L else capacity * 2L
while (newsize < size) newsize *= 2L
resize(Math.min(newsize, Int.MaxValue).toInt)
}
}
See this scastie for a demonstration of the overflow, and infinite loop, averted: https://scastie.scala-lang.org/YdRvtUhdTIuAOhbDPtDEbA
As a consequence of this proposed revision, executions that attempt to add more elements to an already full ArrayBuilder will incur an ArrayIndexOutOfBoundsException
.
Should this revision follow ArrayBuffer
's example and throw new Exception(s"Collections can not have more than ${Int.MaxValue} elements")
when a full ArrayBuilder
receives calls to addOne
or addAll
or should we defer to the more cryptic ArrayIndexOutOfBoundsException
?
@scala/collections crew may have some insights
From a broader perspective, maybe the risk of newsize
overflow causes less trouble than the way that, on every resize()
invocation, ArrayBuilder
:
What's more, when these builders get reused, they never relinquish memory allocated to them in prior uses. Rather, on each call to clear()
, ArrayBuilder[AnyRef]
writes null
into all of the previously used Array
slots and that costs about as much as allocating a new Array; negating some of the performance improvements implied by reuse. ArrayBuilder[T <: AnyVal]
just leaves dirty data in the Array
, presumably for speed.
However, reusing an ArrayBuilder
in this way resembles a memory leak because although each reuse may increase its memory footprint, nothing ever decreases it. In the worst case, this means each reused ArrayBuilder[T]
hoards roughly 9-18 GB of heap space even if it only reached maxSize
once over the course of any number of reuses.
Instead of merely guarding against the newsize overflow risk, I'd like to propose an alternative implementation of ArrayBuilder
which, on resize()
invocations:
and also doesn't act like a memory leak when reused to build Array
s of diverse sizes.
This approach largely makes newsize
overflow concerns moot, but necessitates a potentially more disruptive pull request.
What kind of resistance, might such a revision face from the community? Should I just push this subtle change for now, or pursue the more aggressive revision directly?
ArrayBuilder
's general resizing policy hasn't changed since 2009, but prior to that, did any experiments pursue more aggressive performance goals?
A less naive approach to this infrastructure-critical feature has the potential to boost application performance across the entire Scala ecosystem. Any reason why we shouldn't try for further optimization?
is there any reason we can't just directly call the ArrayBuffer
resizing code? we ought to be able to tweak access if needed
I see a few reasons why we shouldn't share code between ArrayBuffer
and ArrayBuilder
but can summarize most of them by saying: they do different things; Buffer supports edits, shrinking, and removals while Builder can only grow monotonically.
If you look into details, though, you'll notice that ArrayBuffer.ensureSize()
lives in a companion object and takes the underlying array as a parameter along with end:Int
and length:Int
while ArrayBuilder.ensureSize()
is an instance method that takes only a single size:Int
parameter.
Though a measure of consolidation between Buffer and Builder is possible, it seems unnecessarily disruptive, especially compared to the 3 line patch proposed above. Do you see any advantage to consolidating them beyond the hope of easing maintenance burdens by shrinking the size of the codebase?
I think that more significant consolidation could happen within ArrayBuilder
itself because: https://github.com/scala/scala/blob/8a2cf63ee5bad8c8c054f76464de0e10226516a0/src/library/scala/collection/mutable/ArrayBuilder.scala#L83
ArrayBuilder.{ofByte, ofShort, ofChar, ofInt, ofLong, ofFloat, ofDouble, ofBoolean, ofRef}
all have almost identical definitions. ofUnit
deviates dramatically from the rest, but toward simplicity. Its implementation uses no memory until result()
gets called.
The builders are specialized because arrays are specialized on JVM.
Modulo binary compatibility, I wouldn't get hung up on the signature of ensureSize
. The important point is to test calculating a number, not to test if a huge allocation succeeds.
If there is one canonical way of figuring out if an array could be bigger, then it makes sense to share it. It's not about code size, but the investment in headaches to nail it down.
I don't know how ArrayBuilder
is used on average across the ecosystem. (One can check.) It's doable to allocate chunks and keep track of total size and current endIndex
. I think I proposed that if the user gives a sizeHint
, that means allocate and copy now, with the expectation that no further allocation is necessary. For chunks, result
always incurs allocating at least twice the final size.
There are many such details if you choose to implement something interesting, plus maintenance.
Consider the investment at https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/output/AbstractByteArrayOutputStream.java which reminds me that a friend maintained one, I've forgotten where he borrowed the original code. That version allowed you to write huge data, but then toArray
would throw. I believe he said, Why would I ever do that?
My guess is that most people using arrays allocate a size they know they need, so that growing a builder isn't a problem (until they want a huge array). Somewhere I just noticed a buffer size. (https://github.com/lampepfl/dotty/pull/15622/commits/2085bfd3cc795f89897a75edc4ade51ede8e505f) You don't care until something gives you a reason to care, and then you know.
Also, it would be nice to lift all boats, but folks who need it can pay someone to wash it on the weekend and keep it filled with gas. Standard library has a threshold for gain/pain. Maybe that's what you meant by resistance.
I saw what you meant by reusability (because it implements ReusableBuilder
). That seems like a doc bug.
Also, it would be nice to lift all boats, but folks who need it can pay someone to wash it on the weekend and keep it filled with gas.
I wonder: Is this some kind of parable, or just a sentence which was intented to a different target?
@OndrejSpanel the saying is, "A rising tide lifts all boats." It would be nice if fixing the problem also helped many other people. I extended the metaphor with car maintenance, but it's useful to note that boats require even more maintenance. The gist is that introducing a maintenance cost can be a net negative. For example, I dread distracting Martin Odersky from his useful work: even if I "fix" something (so-called "fix"), if it requires even a brief period of his attention, is it worth it?
I think I proposed that if the user gives a
sizeHint
, that means allocate and copy now, with the expectation that no further allocation is necessary.
If an application can determine the size in advance, why would it use a builder at all? To me, sizeHint
provides a way to skip tedious allocate & copy steps when the general order of the size is known.
However, sizeHint
interpreted your way deflects responsibility for performance problems back on the users for providing bad estimates. :)
My guess is that most people using arrays allocate a size they know they need, so that growing a builder isn't a problem (until they want a huge array). Somewhere I just noticed a buffer size. (lampepfl/dotty@2085bfd) You don't care until something gives you a reason to care, and then you know.
Exactly! However if you look at ArrayOps: https://github.com/scala/scala/blob/2.13.x/src/library/scala/collection/ArrayOps.scala#L86
You'll find ArrayBuilder
does all heavy lifting for: map, flatMap, filter, partition, partitionMap
, etc. Maybe most people who use ArrayBuilder
don't even know they're using it.
The builders are specialized because arrays are specialized on JVM.
good point. this looks like a bunch of straightforward but extremely tedious busy-work for whoever wants to implement this
OK. My plan was to get this working in a less mainstream Array library, then after validating it there with benchmarks and tests, try to move it into scala.
Today, I implemented my brilliant idea for making ArrayBuilder
much, much faster and it performed 2X slower than the existing implementation. However, for any arrays larger than 1073741824, the current implementation definitely does enter that infinite loop and mine didn't. I'll just eliminate the infinite loop issue and submit the PR.
@dragonfly-ai If your idea is not recorded above, you may wish to add it now for future reference. Even experience of "how I tested it" can be useful shared knowledge (or shared ignorance, as the case may be).
Arrays on JVM are tricky to reason about (like everything) because of initialization vs optimization; I can only guess at the subtleties. Also the storefront to open is, "Subtle Tees".
My idea:
Instead of copying the array values into a new array that's twice as large, I thought I could save compute and memory by using a jagged 2D array.
Basically, the resize step would initialize a new array that's the same size as the current capacity and store that in the next index of the 2D array; no need to copy anything!
It turns out, that Java's:
@IntrinsicCandidate
public static <T,U> T[] copyOf(U[] original, int newLength, Class<? extends T[]> newType) { ... }
method in java.util.Arrays
does something faster than calling new Array[T](currentSize)
even when newLength
is twice as big as size
.
I tested this, though not rigorously, by repeatedly measuring the time it takes to create a builder for Array[Int]
and append values from 0 until Int.MaxInt - 2
. Then did the same for the builtin ArrayBuilder
. Under this test, ArrayBuilder
gets into an infinite loop, so I decreased the maximum size to 1073741824.
So long as the size stays below 1073741824, trials with the existing ArrayBuilder
completed twice as fast as my custom one.
Here's the most relevant excerpt from the source code of my failed implementation:
object AppendableNArray {
val InitialCapacity: Int = 16
val MaxBucketCount: Int = 27
val MaxSize:Int = 2147483645 // Int.MaxValue - 16
def nextBucket[T](currentSize:Int)(using ct: ClassTag[T]):NArray[T] = new NArray[T](
if (currentSize < 536870913) currentSize
else if (currentSize < MaxSize) MaxSize - currentSize
else throw java.lang.OutOfMemoryError("Requested array size exceeds VM limit")
)
}
class AppendableNArray[T](using ct: ClassTag[T]) extends Appends[T] {
import AppendableNArray.*
override type A = AppendableNArray[T]
private var filledBucketSize: Int = 0
// index of the current bucket
private var b:Int = 0
private var bucket:NArray[T] = new NArray[T](InitialCapacity)
private var buckets: NArray[NArray[T]] = null
// index within the current bucket
private var bi:Int = 0
override def append[T1 >: T](element:T):AppendableNArray[T] = {
if (bi < bucket.length) bucket(bi) = element
else {
filledBucketSize = filledBucketSize + bucket.length
if (buckets == null) buckets = new NArray[NArray[T]](MaxBucketCount)
if (b < buckets.length) {
buckets(b) = bucket
b += 1
bi = 0
bucket = AppendableNArray.nextBucket[T](filledBucketSize)
bucket(bi) = element
} else throw new Exception("AppendableNArray reached max capacity.")
}
bi += 1
this
}
}
Although twice as slow, it does have a smaller memory footprint except in very specific cases when the user calls toArray
; namely when the array underlying the builder is exactly filled.
It also doesn't suffer from the overflow error, but since we can fix that in the existing implementation, overall, I'd say its disadvantages outweigh its advantages.
Also the storefront to open is, "Subtle Tees".
http://subtletees.com Squatters beat us to it.
Because the test for this bug fix requires over 18 gigabytes of heap space, thanks to @SethTisue and @som-snytt for suggesting the magic comment // java: -Xmx24g -Xms8g
needed to fork a new JVM while running test/files/run/array-builder-test.scala
, as I prepare the PR, should I hold back the testing code, or include it?
PR Submitted, but I omitted the tests because they're so memory intensive.
Added a unit test that exercises the resizing logic without allocating memory. Done?
References
Follow-up to https://github.com/scala/bug/issues/12464
Problem
ArrayBuilder
doesn't attempt smart resizing as the size grows large. It still just tries naive doubling of size.When user offers a
sizeHint
, that is probably a clue to allocate a result buffer (and copy if necessary).Otherwise, the implementer might choose to assemble partial buffers at
result
time, as suggested on chat.