cmd/compile: iterators cannot be composed without incurring extra heap allocations (val + func literals)

coxley commented 1 month ago

Go version

go 1.23 / go1.24-fc968d4 / go1.24-165bf24

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/coxley/Library/Caches/go-build'
GOENV='/Users/coxley/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/coxley/.go/pkg/mod'
GOOS='darwin'
GOPATH='/Users/coxley/.go'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.23.1/libexec'
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.1/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/coxley/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/var/folders/3s/w_9thc5x5zj6d5vyf_9j9xw00000gn/T/tmp.vEBTr6BuI7/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/3s/w_9thc5x5zj6d5vyf_9j9xw00000gn/T/go-build3137110682=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Runnable benchmark: https://go.dev/play/p/1J9H8AL2aGM

Created two types:

type Slice[T any] []T

type Collection[T any] struct {
  underlying Slice[T]
}

This simulates an actual use-case I have: an AVL tree being embedded in a graph data structure. Iterating from a higher-order type should be as performant as iterating from the foundational one.

When moving away from callback-style iteration (avl.Each(func(key int64, val *Edge) { /* ... */ })) to idiomatic iterators, I noticed a non-trivial performance regression. Granted, some of this iteration is in a tight-loop so maybe it isn't affecting every program, but after digging I noticed that higher-order iterators' func literals escape to the heap while the lowest-order one's do not.

The benchmark link above demonstrates. I'm not sure if this is #66469, #69015, or simply related. But the first issue didn't get any traction which was before the 1.23 release.

What did you see happen?

A basic iter.Seq style iterator performs 0 allocations.
Another iter.Seq style iterator on that dispatches to the other incurs allocations — 6 in the go.dev/play example.

What did you expect to see?

Equivalent performance of both.

The only current option is to duplicate iteration logic instead of composing iterators.

gabyhelp commented 1 month ago

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

coxley commented 1 month ago

I had hoped "cmd/compile: fix wrong esacpe analysis for rangefunc" would fix it, but it appears the composed version has the same cost in both go 1.23.1 and master.

> go version
go version go1.23.1 darwin/arm64

> go build -gcflags='-m=3' main.go 2>&1 | grep 'inline Collection\[go.shape.int\].Values'
./main.go:59:6: can inline Collection[go.shape.int].Values with cost 17 as: method(Collection[go.shape.int]) func(*[8]uintptr) iter.Seq[go.shape.int] { return func literal }
./main.go:60:9: cannot inline Collection[go.shape.int].Values.func1: function too complex: cost 189 exceeds budget 80
./main.go:61:3: cannot inline Collection[go.shape.int].Values.func1-range1: function too complex: cost 142 exceeds budget 80
./main.go:46:9: can inline Collection[go.shape.int].Values.func1.Slice[go.shape.int].All.1 with cost 72 as: func(func(int, go.shape.int) bool) { for loop }

> /Users/coxley/projects/go/bin/go version
go version devel go1.24-165bf24 2024-09-19 00:40:50 +0000 darwin/arm64

> /Users/coxley/projects/go/bin/go build -gcflags='-m=3' main.go 2>&1 | grep 'inline Collection\[go.shape.int\].Values'
./main.go:59:6: can inline Collection[go.shape.int].Values with cost 17 as: method(Collection[go.shape.int]) func(*[8]uintptr) iter.Seq[go.shape.int] { return func literal }
./main.go:60:9: cannot inline Collection[go.shape.int].Values.func1: function too complex: cost 189 exceeds budget 80
./main.go:61:3: cannot inline Collection[go.shape.int].Values.func1-range1: function too complex: cost 142 exceeds budget 80
./main.go:46:9: can inline Collection[go.shape.int].Values.func1.Slice[go.shape.int].All.1 with cost 72 as: func(func(int, go.shape.int) bool) { for loop }

nemith commented 1 month ago

Moving the underlying iterator out of the closure seems to remove the allocations but it is still a lot slower.

func (s Collection[T]) Values() iter.Seq[T] {
    underlyingIter := s.underlying.All()
    return func(yield func(T) bool) {
        for _, v := range underlyingIter {
            if !yield(v) {
                return
            }
        }
    }
}

$ go run main.go
Direct:        5           200.0 ns/op        0 B/op           0 allocs/op
Nested:        5          5033 ns/op        0 B/op         0 allocs/op

cuonglm commented 1 month ago

I had hoped "cmd/compile: fix wrong esacpe analysis for rangefunc" would fix it, but it appears the composed version has the same cost in both go 1.23.1 and master.

The fix is about wrong analysis of escape analysis, so it won't change the inlining cost.

dr2chase commented 2 weeks ago

Not entirely sure what's gone wrong here -- when compiled "-m", it helpfully reports that it can inline all the things with recent closure-inline-enhancing CLs applied, but it is not faster.

Correction -- it is NOT inlining all the things.

dr2chase commented 2 weeks ago

Here's another benchmark -- if nothing else, assigning to "_" in the loop body gets optimized away. I'd also rather write these as go-test benchmarks, which reduces the need to check to see if something non-standard is going on.

package foo_test

import (
    "iter"
    "testing"
)

var u int

func BenchmarkDirect(b *testing.B) {
    s := Slice[int](make([]int, 64))
    for range b.N {
        for _, v := range s.All() {
            u = v
        }
    }
}

func BenchmarkNestedA(b *testing.B) {
    collection := Collection[int]{Slice[int](make([]int, 64))}
    for range b.N {
        for v := range collection.ValuesA() {
            u = v
        }
    }
}

func BenchmarkNestedB(b *testing.B) {
    collection := Collection[int]{Slice[int](make([]int, 64))}
    for range b.N {
        for v := range collection.ValuesB() {
            u = v
        }
    }
}

func BenchmarkNestedACall(b *testing.B) {
    collection := Collection[int]{Slice[int](make([]int, 64))}
    yield := func(i int) bool {
        u = i
        return true
    }
    for range b.N {
        collection.ValuesA()(yield)
    }
}

func BenchmarkNestedBCall(b *testing.B) {
    collection := Collection[int]{Slice[int](make([]int, 64))}
    yield := func(i int) bool {
        u = i
        return true
    }
    for range b.N {
        collection.ValuesB()(yield)
    }
}

type Slice[T any] []T

func (s Slice[T]) All() iter.Seq2[int, T] {
    return func(yield func(int, T) bool) {
        for i, v := range s {
            if !yield(i, v) {
                return
            }
        }
    }
}

type Collection[T any] struct {
    underlying Slice[T]
}

func (s Collection[T]) ValuesA() iter.Seq[T] {
    return func(yield func(T) bool) {
        for _, v := range s.underlying.All() {
            if !yield(v) {
                return
            }
        }
    }
}

func (s Collection[T]) ValuesB() iter.Seq[T] {
    underlyingIter := s.underlying.All()
    return func(yield func(T) bool) {
        for _, v := range underlyingIter {
            if !yield(v) {
                return
            }
        }
    }
}

gopherbot commented 2 weeks ago

Change https://go.dev/cl/622415 mentions this issue: cmd/compile: spell "go.runtime" correctly for inline "cheap" test

coxley commented 2 weeks ago

@dr2chase Sorry for the non-standard benchmark — only did that so it was viewable in go.dev/play.

dr2chase commented 4 days ago

This is a dupe of #69411 -- same bug, same root cause. The inliner needs to get better at closures.

golang / go