Open josharian opened 6 years ago
We already inline memmove for small constant size in generic.rules. So if we introduce a newobjectNoClr, we could probably also rewrite newobject with rules. which may remove some overhead of the full pass.
There are two problems with doing it with rules.
(1) What we need to do is modify the OpStaticCall's Aux value and leave the overall set of values unchanged; that's hard to do with rules. The first half you can fake by having a condition that has a side-effect, but the second half is not currently possible in a clean way. Maybe we could have a magic "origv" RHS value or some such.
(2) For the *[2]int
case, you need nested rules to check that both ints are overwritten. For the *[3]int
case you need deeper nested rules. You see the problem. :)
Hmm. Another possible use for an rtcall
pass: slicebytetostring
.
Consider something like:
func f(b []byte) int {
s := string(b[:4])
return len(s) // or do something else non-escaping with s
}
This ends up compiling to runtime.slicebytetostring(SP+48, b.ptr, 4, b.cap)
. This is equivalent to memmove(SP+48, b.ptr, 4)
. And that can be simplified further in turn.
This optimization is hard to do during walk, because b[:4]
gets rewritten into an autotmp during order, and we don't rediscover the constant length until SSA.
cc @randall77 for opinions in general about this
please also do this for make
: https://github.com/golang/go/issues/23905
I'd like to see removal of zeroing if we know the object will be completely overwritten. It could be implemented using an upgraded deadstore pass. We can check at each allocation whether all its fields are shadowed. Pointers fields are a bit tricky because of the write barriers. I think we can only do it for completely scalar objects.
Re: ptr-containing objects, do you think https://github.com/golang/go/issues/24928 would do the trick?
I'm not sure. I think it's probably faster to zero a whole object than to selectively zero fields. Unless there are very large sections which don't have pointers, but that seems rare, and detecting subsequent complete overwriting of such objects also seems hard.
Regarding the original topic, another approach is to generate different code in the front-end. See https://github.com/golang/go/issues/29446#issuecomment-450537283.
Change https://golang.org/cl/367496 mentions this issue: cmd/compile: add memcrlelim SSA optimization pass
I mentioned some benchmark results in the CL comment.
Posting the benchmarks themselves here.
package example
import (
"testing"
)
type smallStruct struct {
f0 int64
f1 float64
f2 int32
f3 int32
}
type averageStruct struct {
f0 uint64
flags [8]bool
small [4]smallStruct
}
//go:noinline
func newSmallStruct(f0 int64, f1 float64, f2, f3 int32) *smallStruct {
return &smallStruct{
f0: f0,
f1: f1,
f2: f2,
f3: f3,
}
}
//go:noinline
func newAverageStruct(v uint64, s0, s1, s2, s3 *smallStruct) *averageStruct {
return &averageStruct{
f0: v,
flags: [8]bool{true, true, true, true, true, true, true, true},
small: [4]smallStruct{*s0, *s1, *s2, *s3},
}
}
//go:noinline
func newInt(v int) *int {
p := new(int)
*p = v
return p
}
//go:noinline
func newIntSlice1(v0 int) []int {
return []int{v0}
}
//go:noinline
func newIntSlice4(v0, v1, v2, v3 int) []int {
return []int{v0, v1, v2, v3}
}
//go:noinline
func newIntSlice8(v0, v1, v2, v3, v4, v5, v6, v7 int) []int {
return []int{v0, v1, v2, v3, v4, v5, v6, v7}
}
//go:noinline
func newIntSlice32(v0, v1, v2, v3, v4, v5, v6, v7 int) []int {
return []int{
v0, v1, v2, v3, v4, v5, v6, v7,
v0, v1, v2, v3, v4, v5, v6, v7,
v0, v1, v2, v3, v4, v5, v6, v7,
v0, v1, v2, v3, v4, v5, v6, v7,
}
}
func BenchmarkNewInt(b *testing.B) {
for i := 0; i < b.N; i++ {
newInt(i)
}
}
func BenchmarkNewIntSlice1(b *testing.B) {
for i := 0; i < b.N; i++ {
newIntSlice1(i)
}
}
func BenchmarkNewIntSlice4(b *testing.B) {
for i := 0; i < b.N; i++ {
newIntSlice4(i, 0, i, 0)
}
}
func BenchmarkNewIntSlice8(b *testing.B) {
for i := 0; i < b.N; i++ {
newIntSlice8(i, 0, i, 0, i, 0, i, 0)
}
}
func BenchmarkNewIntSlice32(b *testing.B) {
for i := 0; i < b.N; i++ {
newIntSlice32(i, 0, i, 0, i, 0, i, 0)
}
}
func BenchmarkNewSmallStruct(b *testing.B) {
for i := 0; i < b.N; i++ {
newSmallStruct(int64(i), 1.5, 1, 2)
}
}
func BenchmarkNewAverageStruct(b *testing.B) {
var s [4]smallStruct
for i := 0; i < b.N; i++ {
newAverageStruct(uint64(i), &s[0], &s[1], &s[2], &s[3])
}
}
Consider:
The first line gets translated into
x = newobject(type-of-int64)
, which callsmallocgc
with a "needszero" argument of true. But it doesn't need zeroing: it has no pointers, and data gets written to the whole thing.Same holds for:
and more interestingly:
We could detect such scenarios in the SSA backend and replace the call to
newobject
to a call to a (newly created)newobjectNoClr
, which is identical tonewobject
except that it passesfalse
tomallocgc
forneedszero
.Aside: The SSA backend already understands
newobject
a little. It removes the pointless zero assignment from:although not from:
Converting to
newobjectNoClr
would probably require a new SSA pass, in which we put values in store order, detect calls tonewobject
, and then check whether subsequent stores obviate the need for zeroing. And also at the same time eliminate unnecessary zeroing that the existing rewrite rules don't cover.This new SSA pass might also someday grow to understand and rewrite e.g. calls to
memmove
andmemequal
with small constant sizes.It is not obvious to me that this pass would pull its weight, compilation-time-wise. Needs experimentation. Filing an issue so that I don't forget about it. :)