cmd/compile: avoiding zeroing new allocations when possible

josharian commented 6 years ago

Consider:

func f() *int {
  x := new(int)
  *x = 1
  return x
}

The first line gets translated into x = newobject(type-of-int64), which calls mallocgc with a "needszero" argument of true. But it doesn't need zeroing: it has no pointers, and data gets written to the whole thing.

Same holds for:

func f() *[2]int {
  x := new([2]int)
  x[0] = 1
  x[1] = 2
  return x
}

and more interestingly:

func f() *[1024]int {
  x := new([1024]int)
  for i := range x {
    x[i] = i
  }
  return x
}

We could detect such scenarios in the SSA backend and replace the call to newobject to a call to a (newly created) newobjectNoClr, which is identical to newobject except that it passes false to mallocgc for needszero.

Aside: The SSA backend already understands newobject a little. It removes the pointless zero assignment from:

func f() *[2]int {
  x := new([2]int)
  x[0] = 0 // removed
  return x
}

although not from:

func f() *[2]int {
  x := new([2]int)
  x[0] = 1
  x[1] = 0 // not removed, but could be
  return x
}

Converting to newobjectNoClr would probably require a new SSA pass, in which we put values in store order, detect calls to newobject, and then check whether subsequent stores obviate the need for zeroing. And also at the same time eliminate unnecessary zeroing that the existing rewrite rules don't cover.

This new SSA pass might also someday grow to understand and rewrite e.g. calls to memmove and memequal with small constant sizes.

It is not obvious to me that this pass would pull its weight, compilation-time-wise. Needs experimentation. Filing an issue so that I don't forget about it. :)

TocarIP commented 6 years ago

We already inline memmove for small constant size in generic.rules. So if we introduce a newobjectNoClr, we could probably also rewrite newobject with rules. which may remove some overhead of the full pass.

josharian commented 6 years ago

There are two problems with doing it with rules.

(1) What we need to do is modify the OpStaticCall's Aux value and leave the overall set of values unchanged; that's hard to do with rules. The first half you can fake by having a condition that has a side-effect, but the second half is not currently possible in a clean way. Maybe we could have a magic "origv" RHS value or some such.

(2) For the *[2]int case, you need nested rules to check that both ints are overwritten. For the *[3]int case you need deeper nested rules. You see the problem. :)

josharian commented 6 years ago

Hmm. Another possible use for an rtcall pass: slicebytetostring.

Consider something like:

func f(b []byte) int {
    s := string(b[:4])
    return len(s) // or do something else non-escaping with s
}

This ends up compiling to runtime.slicebytetostring(SP+48, b.ptr, 4, b.cap). This is equivalent to memmove(SP+48, b.ptr, 4). And that can be simplified further in turn.

This optimization is hard to do during walk, because b[:4] gets rewritten into an autotmp during order, and we don't rediscover the constant length until SSA.

josharian commented 6 years ago

cc @randall77 for opinions in general about this

dotaheor commented 6 years ago

please also do this for make: https://github.com/golang/go/issues/23905

randall77 commented 6 years ago

I'd like to see removal of zeroing if we know the object will be completely overwritten. It could be implemented using an upgraded deadstore pass. We can check at each allocation whether all its fields are shadowed. Pointers fields are a bit tricky because of the write barriers. I think we can only do it for completely scalar objects.

josharian commented 6 years ago

Re: ptr-containing objects, do you think https://github.com/golang/go/issues/24928 would do the trick?

randall77 commented 6 years ago

I'm not sure. I think it's probably faster to zero a whole object than to selectively zero fields. Unless there are very large sections which don't have pointers, but that seems rare, and detecting subsequent complete overwriting of such objects also seems hard.

josharian commented 5 years ago

Regarding the original topic, another approach is to generate different code in the front-end. See https://github.com/golang/go/issues/29446#issuecomment-450537283.

gopherbot commented 2 years ago

Change https://golang.org/cl/367496 mentions this issue: cmd/compile: add memcrlelim SSA optimization pass

quasilyte commented 2 years ago

I mentioned some benchmark results in the CL comment.

Posting the benchmarks themselves here.

package example

import (
    "testing"
)

type smallStruct struct {
    f0 int64
    f1 float64
    f2 int32
    f3 int32
}

type averageStruct struct {
    f0 uint64
    flags [8]bool
    small [4]smallStruct
}

//go:noinline
func newSmallStruct(f0 int64, f1 float64, f2, f3 int32) *smallStruct {
    return &smallStruct{
        f0: f0,
        f1: f1,
        f2: f2,
        f3: f3,
    }
}

//go:noinline
func newAverageStruct(v uint64, s0, s1, s2, s3 *smallStruct) *averageStruct {
    return &averageStruct{
        f0: v,
        flags: [8]bool{true, true, true, true, true, true, true, true},
        small: [4]smallStruct{*s0, *s1, *s2, *s3},
    }
}

//go:noinline
func newInt(v int) *int {
    p := new(int)
    *p = v
    return p
}

//go:noinline
func newIntSlice1(v0 int) []int {
    return []int{v0}
}

//go:noinline
func newIntSlice4(v0, v1, v2, v3 int) []int {
    return []int{v0, v1, v2, v3}
}

//go:noinline
func newIntSlice8(v0, v1, v2, v3, v4, v5, v6, v7 int) []int {
    return []int{v0, v1, v2, v3, v4, v5, v6, v7}
}

//go:noinline
func newIntSlice32(v0, v1, v2, v3, v4, v5, v6, v7 int) []int {
    return []int{
        v0, v1, v2, v3, v4, v5, v6, v7,
        v0, v1, v2, v3, v4, v5, v6, v7,
        v0, v1, v2, v3, v4, v5, v6, v7,
        v0, v1, v2, v3, v4, v5, v6, v7,
    }
}

func BenchmarkNewInt(b *testing.B) {
    for i := 0; i < b.N; i++ {
        newInt(i)
    }
}

func BenchmarkNewIntSlice1(b *testing.B) {
    for i := 0; i < b.N; i++ {
        newIntSlice1(i)
    }
}

func BenchmarkNewIntSlice4(b *testing.B) {
    for i := 0; i < b.N; i++ {
        newIntSlice4(i, 0, i, 0)
    }
}

func BenchmarkNewIntSlice8(b *testing.B) {
    for i := 0; i < b.N; i++ {
        newIntSlice8(i, 0, i, 0, i, 0, i, 0)
    }
}

func BenchmarkNewIntSlice32(b *testing.B) {
    for i := 0; i < b.N; i++ {
        newIntSlice32(i, 0, i, 0, i, 0, i, 0)
    }
}

func BenchmarkNewSmallStruct(b *testing.B) {
    for i := 0; i < b.N; i++ {
        newSmallStruct(int64(i), 1.5, 1, 2)
    }
}

func BenchmarkNewAverageStruct(b *testing.B) {
    var s [4]smallStruct
    for i := 0; i < b.N; i++ {
        newAverageStruct(uint64(i), &s[0], &s[1], &s[2], &s[3])
    }
}

golang / go

cmd/compile: avoiding zeroing new allocations when possible #24926