ArrayBuilder should use smart sizing from ArrayBuffer

som-snytt commented 2 years ago

References

Follow-up to https://github.com/scala/bug/issues/12464

Problem

ArrayBuilder doesn't attempt smart resizing as the size grows large. It still just tries naive doubling of size.

When user offers a sizeHint, that is probably a clue to allocate a result buffer (and copy if necessary).

Otherwise, the implementer might choose to assemble partial buffers at result time, as suggested on chat.

dragonfly-ai commented 2 years ago

I'm on it!

dragonfly-ai commented 2 years ago

ArrayBuffer.ensureSize deals with offsets and lengths while ArrayBuilder.ensureSize takes a size:Int parameter directly. Although the current version quietly ignores overflowed/negative size parameter values, as size approaches Int.MaxValue, ArrayBuilder's naive size doubling policy can cause newsize to overflow and trap ensureSize in a long, possibly infinite loop.

See this scastie for a demonstration of the infinite loop (you might not want to run it): https://scastie.scala-lang.org/wKOS1p2mTlaCZE9MwdNg7w

If all we want to do, is protect newsize from overflow, it's a 3 line change from:

protected[this] final def ensureSize(size: Int  /* <- overflow impossible. negative sign? */ ): Unit = {
  if (capacity < size || capacity == 0) { /* ignore if negative */
    var newsize = if (capacity == 0) 16 else capacity * 2
    while (newsize < size) newsize *= 2  /* <- this can overflow continually in an infinite loop. */
    resize(newsize)
  }
}

to:

protected[this] final def ensureSize(size: Int): Unit = {
  if (capacity < size || capacity == 0) {
    var newsize:Long = if (capacity == 0) 16L else capacity * 2L
    while (newsize < size) newsize *= 2L
    resize(Math.min(newsize, Int.MaxValue).toInt)
  }
}

See this scastie for a demonstration of the overflow, and infinite loop, averted: https://scastie.scala-lang.org/YdRvtUhdTIuAOhbDPtDEbA

As a consequence of this proposed revision, executions that attempt to add more elements to an already full ArrayBuilder will incur an ArrayIndexOutOfBoundsException.

Should this revision follow ArrayBuffer's example and throw new Exception(s"Collections can not have more than ${Int.MaxValue} elements") when a full ArrayBuilder receives calls to addOne or addAll or should we defer to the more cryptic ArrayIndexOutOfBoundsException ?

SethTisue commented 2 years ago

@scala/collections crew may have some insights

dragonfly-ai commented 2 years ago

From a broader perspective, maybe the risk of newsize overflow causes less trouble than the way that, on every resize() invocation, ArrayBuilder:

Temporarily triples its memory footprint for the sake of doubling it.
Incurs an O(N) copy operation penalty.
Adds a garbage collection task.

What's more, when these builders get reused, they never relinquish memory allocated to them in prior uses. Rather, on each call to clear(), ArrayBuilder[AnyRef] writes null into all of the previously used Array slots and that costs about as much as allocating a new Array; negating some of the performance improvements implied by reuse. ArrayBuilder[T <: AnyVal] just leaves dirty data in the Array, presumably for speed.

However, reusing an ArrayBuilder in this way resembles a memory leak because although each reuse may increase its memory footprint, nothing ever decreases it. In the worst case, this means each reused ArrayBuilder[T] hoards roughly 9-18 GB of heap space even if it only reached maxSize once over the course of any number of reuses.

Instead of merely guarding against the newsize overflow risk, I'd like to propose an alternative implementation of ArrayBuilder which, on resize() invocations:

Copies no element data.
Doubles, instead of triples, its memory footprint.
Necessitates no additional garbage collection.

and also doesn't act like a memory leak when reused to build Arrays of diverse sizes.

This approach largely makes newsize overflow concerns moot, but necessitates a potentially more disruptive pull request.

What kind of resistance, might such a revision face from the community? Should I just push this subtle change for now, or pursue the more aggressive revision directly?

ArrayBuilder's general resizing policy hasn't changed since 2009, but prior to that, did any experiments pursue more aggressive performance goals?

A less naive approach to this infrastructure-critical feature has the potential to boost application performance across the entire Scala ecosystem. Any reason why we shouldn't try for further optimization?

NthPortal commented 2 years ago

is there any reason we can't just directly call the ArrayBuffer resizing code? we ought to be able to tweak access if needed

dragonfly-ai commented 2 years ago

I see a few reasons why we shouldn't share code between ArrayBuffer and ArrayBuilder but can summarize most of them by saying: they do different things; Buffer supports edits, shrinking, and removals while Builder can only grow monotonically.

If you look into details, though, you'll notice that ArrayBuffer.ensureSize() lives in a companion object and takes the underlying array as a parameter along with end:Int and length:Int while ArrayBuilder.ensureSize() is an instance method that takes only a single size:Int parameter.

Though a measure of consolidation between Buffer and Builder is possible, it seems unnecessarily disruptive, especially compared to the 3 line patch proposed above. Do you see any advantage to consolidating them beyond the hope of easing maintenance burdens by shrinking the size of the codebase?

I think that more significant consolidation could happen within ArrayBuilder itself because: https://github.com/scala/scala/blob/8a2cf63ee5bad8c8c054f76464de0e10226516a0/src/library/scala/collection/mutable/ArrayBuilder.scala#L83

ArrayBuilder.{ofByte, ofShort, ofChar, ofInt, ofLong, ofFloat, ofDouble, ofBoolean, ofRef}

all have almost identical definitions. ofUnit deviates dramatically from the rest, but toward simplicity. Its implementation uses no memory until result() gets called.

som-snytt commented 2 years ago

The builders are specialized because arrays are specialized on JVM.

Modulo binary compatibility, I wouldn't get hung up on the signature of ensureSize. The important point is to test calculating a number, not to test if a huge allocation succeeds.

If there is one canonical way of figuring out if an array could be bigger, then it makes sense to share it. It's not about code size, but the investment in headaches to nail it down.

I don't know how ArrayBuilder is used on average across the ecosystem. (One can check.) It's doable to allocate chunks and keep track of total size and current endIndex. I think I proposed that if the user gives a sizeHint, that means allocate and copy now, with the expectation that no further allocation is necessary. For chunks, result always incurs allocating at least twice the final size.

There are many such details if you choose to implement something interesting, plus maintenance.

Consider the investment at https://github.com/apache/commons-io/blob/master/src/main/java/org/apache/commons/io/output/AbstractByteArrayOutputStream.java which reminds me that a friend maintained one, I've forgotten where he borrowed the original code. That version allowed you to write huge data, but then toArray would throw. I believe he said, Why would I ever do that?

My guess is that most people using arrays allocate a size they know they need, so that growing a builder isn't a problem (until they want a huge array). Somewhere I just noticed a buffer size. (https://github.com/lampepfl/dotty/pull/15622/commits/2085bfd3cc795f89897a75edc4ade51ede8e505f) You don't care until something gives you a reason to care, and then you know.

Also, it would be nice to lift all boats, but folks who need it can pay someone to wash it on the weekend and keep it filled with gas. Standard library has a threshold for gain/pain. Maybe that's what you meant by resistance.

I saw what you meant by reusability (because it implements ReusableBuilder). That seems like a doc bug.

OndrejSpanel commented 2 years ago

Also, it would be nice to lift all boats, but folks who need it can pay someone to wash it on the weekend and keep it filled with gas.

I wonder: Is this some kind of parable, or just a sentence which was intented to a different target?

som-snytt commented 2 years ago

@OndrejSpanel the saying is, "A rising tide lifts all boats." It would be nice if fixing the problem also helped many other people. I extended the metaphor with car maintenance, but it's useful to note that boats require even more maintenance. The gist is that introducing a maintenance cost can be a net negative. For example, I dread distracting Martin Odersky from his useful work: even if I "fix" something (so-called "fix"), if it requires even a brief period of his attention, is it worth it?

dragonfly-ai commented 2 years ago

I think I proposed that if the user gives a sizeHint, that means allocate and copy now, with the expectation that no further allocation is necessary.

If an application can determine the size in advance, why would it use a builder at all? To me, sizeHint provides a way to skip tedious allocate & copy steps when the general order of the size is known.

However, sizeHint interpreted your way deflects responsibility for performance problems back on the users for providing bad estimates. :)

My guess is that most people using arrays allocate a size they know they need, so that growing a builder isn't a problem (until they want a huge array). Somewhere I just noticed a buffer size. (lampepfl/dotty@2085bfd) You don't care until something gives you a reason to care, and then you know.

Exactly! However if you look at ArrayOps: https://github.com/scala/scala/blob/2.13.x/src/library/scala/collection/ArrayOps.scala#L86 You'll find ArrayBuilder does all heavy lifting for: map, flatMap, filter, partition, partitionMap, etc. Maybe most people who use ArrayBuilder don't even know they're using it.

NthPortal commented 2 years ago

The builders are specialized because arrays are specialized on JVM.

good point. this looks like a bunch of straightforward but extremely tedious busy-work for whoever wants to implement this

dragonfly-ai commented 1 year ago

OK. My plan was to get this working in a less mainstream Array library, then after validating it there with benchmarks and tests, try to move it into scala.

dragonfly-ai commented 1 year ago

Today, I implemented my brilliant idea for making ArrayBuilder much, much faster and it performed 2X slower than the existing implementation. However, for any arrays larger than 1073741824, the current implementation definitely does enter that infinite loop and mine didn't. I'll just eliminate the infinite loop issue and submit the PR.

som-snytt commented 1 year ago

@dragonfly-ai If your idea is not recorded above, you may wish to add it now for future reference. Even experience of "how I tested it" can be useful shared knowledge (or shared ignorance, as the case may be).

Arrays on JVM are tricky to reason about (like everything) because of initialization vs optimization; I can only guess at the subtleties. Also the storefront to open is, "Subtle Tees".

dragonfly-ai commented 1 year ago

My idea:

Instead of copying the array values into a new array that's twice as large, I thought I could save compute and memory by using a jagged 2D array.

Basically, the resize step would initialize a new array that's the same size as the current capacity and store that in the next index of the 2D array; no need to copy anything!

It turns out, that Java's:

@IntrinsicCandidate
public static <T,U> T[] copyOf(U[] original, int newLength, Class<? extends T[]> newType) { ... }

method in java.util.Arrays does something faster than calling new Array[T](currentSize) even when newLength is twice as big as size.

I tested this, though not rigorously, by repeatedly measuring the time it takes to create a builder for Array[Int] and append values from 0 until Int.MaxInt - 2. Then did the same for the builtin ArrayBuilder. Under this test, ArrayBuilder gets into an infinite loop, so I decreased the maximum size to 1073741824.

So long as the size stays below 1073741824, trials with the existing ArrayBuilder completed twice as fast as my custom one.

Here's the most relevant excerpt from the source code of my failed implementation:


object AppendableNArray {
  val InitialCapacity: Int = 16
  val MaxBucketCount: Int = 27
  val MaxSize:Int = 2147483645 // Int.MaxValue - 16

  def nextBucket[T](currentSize:Int)(using ct: ClassTag[T]):NArray[T] = new NArray[T](
    if (currentSize < 536870913) currentSize
    else if (currentSize < MaxSize) MaxSize - currentSize
    else throw java.lang.OutOfMemoryError("Requested array size exceeds VM limit")
  )

}

class AppendableNArray[T](using ct: ClassTag[T]) extends Appends[T] {
  import AppendableNArray.*
  override type A = AppendableNArray[T]

  private var filledBucketSize: Int = 0

  // index of the current bucket
  private var b:Int = 0

  private var bucket:NArray[T] = new NArray[T](InitialCapacity)
  private var buckets: NArray[NArray[T]] = null

  // index within the current bucket
  private var bi:Int = 0
  override def append[T1 >: T](element:T):AppendableNArray[T] = {

    if (bi < bucket.length) bucket(bi) = element
    else {

      filledBucketSize = filledBucketSize + bucket.length

      if (buckets == null) buckets = new NArray[NArray[T]](MaxBucketCount)

      if (b < buckets.length) {
        buckets(b) = bucket
        b += 1
        bi = 0

        bucket = AppendableNArray.nextBucket[T](filledBucketSize)
        bucket(bi) = element
      } else throw new Exception("AppendableNArray reached max capacity.")
    }

    bi += 1
    this
  }
}

Although twice as slow, it does have a smaller memory footprint except in very specific cases when the user calls toArray; namely when the array underlying the builder is exactly filled.

It also doesn't suffer from the overflow error, but since we can fix that in the existing implementation, overall, I'd say its disadvantages outweigh its advantages.

dragonfly-ai commented 1 year ago

Also the storefront to open is, "Subtle Tees".

http://subtletees.com Squatters beat us to it.

dragonfly-ai commented 1 year ago

Because the test for this bug fix requires over 18 gigabytes of heap space, thanks to @SethTisue and @som-snytt for suggesting the magic comment // java: -Xmx24g -Xms8g needed to fork a new JVM while running test/files/run/array-builder-test.scala, as I prepare the PR, should I hold back the testing code, or include it?

dragonfly-ai commented 1 year ago

PR Submitted, but I omitted the tests because they're so memory intensive.

dragonfly-ai commented 1 year ago

Added a unit test that exercises the resizing logic without allocating memory. Done?

dragonfly-ai commented 1 year ago

Moved to a new PR.

scala / bug

ArrayBuilder should use smart sizing from ArrayBuffer #12617

References

Problem