golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.99k stars 17.67k forks source link

cmd/compile,runtime: speed up heap allocations in loops #60879

Open CAFxX opened 1 year ago

CAFxX commented 1 year ago

There is currently a large performance gap between allocating N elements of type T, and allocating a slice of N elements of type T (even in the case in which T does not contain pointers and the size of T is equal to the size class used for T).

Just to give an idea, the following benchmark[^1] shows the performance gap between the two, for varying number of elements N (old is N allocations, new is a single allocation of a slice of N elements):

name          old time/op  new time/op  delta
Allocs/1      65.4ns ± 0%  65.2ns ± 0%     ~     (p=0.079 n=4+5)
Allocs/2       100ns ± 3%    74ns ±11%  -25.60%  (p=0.008 n=5+5)
Allocs/4       184ns ± 1%    84ns ± 4%  -54.26%  (p=0.016 n=4+5)
Allocs/8       315ns ± 1%   103ns ±11%  -67.41%  (p=0.008 n=5+5)
Allocs/16      580ns ± 2%   132ns ±12%  -77.17%  (p=0.008 n=5+5)
Allocs/32     1.18µs ±16%  0.19µs ±24%  -83.93%  (p=0.008 n=5+5)
Allocs/64     2.21µs ± 0%  0.29µs ±25%  -86.93%  (p=0.008 n=5+5)
Allocs/128    4.75µs ±22%  0.51µs ± 9%  -89.32%  (p=0.008 n=5+5)
Allocs/256    12.6µs ±17%   0.9µs ± 4%  -92.50%  (p=0.008 n=5+5)
Allocs/512    22.4µs ±12%   1.8µs ± 5%  -91.94%  (p=0.008 n=5+5)
Allocs/1024   42.5µs ± 6%   3.5µs ± 5%  -91.81%  (p=0.008 n=5+5)
Allocs/2048   85.2µs ±12%   7.1µs ±13%  -91.72%  (p=0.008 n=5+5)
Allocs/4096    169µs ±10%    14µs ± 8%  -91.82%  (p=0.008 n=5+5)
Allocs/8192    298µs ±15%    25µs ± 1%  -91.63%  (p=0.008 n=5+5)
Allocs/16384   542µs ± 1%    49µs ± 3%  -91.02%  (p=0.008 n=5+5)
Allocs/32768  1.07ms ± 1%  0.09ms ± 2%  -91.11%  (p=0.008 n=5+5)
Allocs/65536  2.13ms ± 2%  0.19ms ± 3%  -91.20%  (p=0.008 n=5+5)

To help close the gap, it may be worthwhile to investigate/consider performing batch allocations for heap allocations that occur inside loops. This is something that happens frequently in unmarshaling code and, more in general, in data processing code.

To clarify, I am not suggesting turning N allocations of type T into an allocation of a vector with N elements of type T, as that would prevent garbage collection of individual elements. Instead by "batch allocation" I mean performing the allocation of N individual elements of type T in a single call to the runtime. In this way each one of the N elements can still be individually garbage collected.

This could mean, for a loop with a trip count known at compile time, turning this:

for ... /* trip count = C (known at compile time) */ {
  e := new(T)
  // use e
}

into this:

a := make([]*T, 0, C) // stack allocated
for ... {
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

and for loops with trip count known at runtime, turning this:

for ... /* trip count = c (not known at compile time) */ {
  e := new(T)
  // use e
}

into:

a := make([]*T, 0, 64) // stack allocated
for ... {
  if len(a) == 0 && remaining_trip_count < cap(a) { // optional
    a = a[:0:remaining_trip_count]
  }
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

and for loops with unknown trip count, turning this:

for ... {
  e := new(T)
  // use e
}

into:

a := make([]*T, 0, 64) // stack allocated
for ... {
  e := new_batched(&a)
  // use e
}
batch_deallocate(a) // optional

where the runtime functions calls inserted above by the compiler would be equivalent to:

func new_batched[T any](a *[]*T) *T {
  if a == nil || cap(*a) <= 1 {
    return new(T)
  }
  if len(*a) == 0 {
    *a = *a[:cap(*a)]
    batch_allocate(*a)
  }
  e := (*a)[len(*a)-1]
  *a = (*a)[:len(*a)-1]
  return e
}

func batch_allocate[T any](a []*T) {
  // this would be a runtime call that allocates len(a) elements of type T
  for i := range a {
    a[i] = new(T)
  }
}

func batch_deallocate[T any](a []*T) {
  // this would be a runtime call that deallocates the len(a) elements of type T
  for _, e := range a {
    // mark e as unused in the corresponding bitmap
  }
}

In a sense, this would entail having loop-scoped bump allocators to avoid entering the main heap allocator for each individual element.

Surely some heuristics would be needed to decide when to use batch allocation and when not to: for example we definitely would not want to use batch allocation for loops with very low trip counts. Nevertheless I think this would be a worthwhile optimization to consider.

An alternative to the above would be to specialize the heap allocator for {size class, has pointers} or otherwise add generic, fast per-P bump allocators (that would have the benefit of being effective for all code, not just in loops). These seem like more invasive changes to the runtime, though.

Other alternatives include the (over)use of sync.Pool but as was mentioned in the past almost any use of sync.Pool is an indication that the memory management subsystem is lacking in one way or the other.

[^1]: go version go1.20.5 darwin/amd64, Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (turbo boost disabled)

rip-create-your-account commented 1 year ago

Opportunities for batch allocation are something that I stumble upon quite regularly and usually there's more than one allocation going on in the loop. Most recently in real code here.

I really don't like doing batch := make([]T, n); obj1 := &batch[0]; obj2 := &batch[1]; ... for the reasons already mentioned in the OP, but it improves performance on all metrics too much to not do it. Please guide us away from the temptation of one-at-a-time processing.