runtime: stack growth during critical section harms performance

gyuho commented 7 years ago

Hello,

etcd observed significant performance regression with Go 1.7.3.

Details follow or please visit https://github.com/coreos/etcd/issues/6876.

What version of Go are you using (`go version`)?

Go 1.6.3 and Go 1.7.3

What operating system and processor architecture are you using (`go env`)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/gyuho/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build467209749=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

Google Cloud VM with 8 vCPUs, 16 GB memory, ubuntu-1610

gcloud compute instances create test-machine --custom-cpu=8 --custom-memory=16 --image-family=ubuntu-1610 --image-project=ubuntu-os-cloud --boot-disk-size=150 --boot-disk-type="pd-ssd" --zone us-west1-a

What did you do?

Build etcd binaries, others

Build etcd binaries in different Go versions with etcd commit https://github.com/coreos/etcd/commit/faeeb2fc7514c5caf7a9a0cc03ac9ee2ff94438b.

Go version is the only difference!

mkdir -p ${GOPATH}/src/github.com/coreos
cd ${GOPATH}/src/github.com/coreos
git clone https://github.com/coreos/etcd.git
cd ${GOPATH}/src/github.com/coreos/etcd
git reset --hard faeeb2fc7514c5caf7a9a0cc03ac9ee2ff94438b

./build
./bin/etcd --version

go build -v ./cmd/tools/benchmark
./benchmark help

Or

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.6.3
wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.7.3
wget https://storage.googleapis.com/etcd/tip/etcdctl-master-go1.7.3
wget https://storage.googleapis.com/etcd/tip/benchmark-master-go1.7.3

sudo chmod +x ./etcd-v3.0.2-go1.6.3      && ./etcd-v3.0.2-go1.6.3 --version
sudo chmod +x ./etcd-v3.0.2-go1.7.3      && ./etcd-v3.0.2-go1.7.3 --version
sudo chmod +x ./etcdctl-master-go1.7.3   && ./etcdctl-master-go1.7.3 --version
sudo chmod +x ./benchmark-master-go1.7.3 && ./benchmark-master-go1.7.3 help

Run single-node cluster, benchmark

Go 1.6.3 + master branch Go 1.7.3 client

rm -rf test.etcd
./etcd-v3.0.2-go1.6.3 --name test \
    --listen-client-urls http://localhost:2379 \
    --advertise-client-urls http://localhost:2379 \
    --listen-peer-urls http://localhost:2380 \
    --initial-advertise-peer-urls http://localhost:2380 \
    --initial-cluster test=http://localhost:2380 \
    --initial-cluster-token test-token \
    --initial-cluster-state new \
    --enable-pprof

ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 put foo bar
ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 get foo --consistency=s
./benchmark-master-go1.7.3 --endpoints=localhost:2379 --conns=100 --clients=1000 range foo --consistency=s --total=200000

Go 1.7.3 + master branch Go 1.7.3 client

rm -rf test.etcd
./etcd-v3.0.2-go1.7.3 --name test \
    --listen-client-urls http://localhost:2379 \
    --advertise-client-urls http://localhost:2379 \
    --listen-peer-urls http://localhost:2380 \
    --initial-advertise-peer-urls http://localhost:2380 \
    --initial-cluster test=http://localhost:2380 \
    --initial-cluster-token test-token \
    --initial-cluster-state new \
    --enable-pprof

ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 put foo bar
ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 get foo --consistency=s
./benchmark-master-go1.7.3 --endpoints=localhost:2379 --conns=100 --clients=1000 range foo --consistency=s --total=200000

What did you expect to see?

etcd with Go 1.6.3 shows:

Summary:
  Total:    55.4612 secs.
  Requests/sec: 36061.2548

What did you see instead?

etcd with Go 1.7.3 is 2x slower:

Summary:
  Total:    105.3251 secs.
  Requests/sec: 18988.8279

We expected Go 1.7.3 to be comparable or better.

Here's our profiling data

Note: etcd uses old golang.org/x/net/http2 but same behavior happens with latest http2; see https://github.com/coreos/etcd/issues/6876#issuecomment-262292981.

Could anybody help us debug this around Go runtime? Please let me know if more information is needed.

Thanks a lot in advance!

mvdan commented 7 years ago

Did you try tip?

gyuho commented 7 years ago

I did; tip has slightly better performance than Go 1.7.3, but still much slower than Go 1.6.3.

rsc commented 7 years ago

Looked at the pb.gz profiles in coreos/etcd#6876. Slightly confusing because the Go 1.7 profile ran 2X as long as the Go 1.6 profile so the total times are not directly comparable. However, in Go 1.6 the scheduler ran for about 2s, while in Go 1.7 the scheduler ran for about 20s, so about 5X more. That probably has something to do with it.

http2 probably does not have anything to do with it. From the profile it seems clear that you are running a vendored copy of http2, not the one from the Go distribution itself.

Go 1.6:

screen shot 2016-11-22 at 3 45 29 pm

Go 1.7:

screen shot 2016-11-22 at 3 44 52 pm

We might be able to fix this in Go 1.8 if the fix is simple.

/cc @aclements @RLH

gyuho commented 7 years ago

I gave different duration to test the same workloads, but they are confusing for profiling.

So I collected pprof data with same duration 30-second:

./benchmark-master-go1.7.3 --endpoints=localhost:2379 --conns=100 --clients=1000 range foo --consistency=s --total=2000000

# Go 1.6.3
  Total:    55.5685 secs.
  Requests/sec: 35991.5985

# Go 1.7.3
Summary:
  Total:    105.6422 secs.
  Requests/sec: 18931.8332

And here are pprof binaries if that helps.

Go 1.6.3

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.6.3 etcd-v3.0.2-go1.6.3-serializable-reads-00.pdf pprof.localhost:2379.samples.cpu.001.pb.gz

Go 1.7.3

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.7.3 etcd-v3.0.2-go1.7.3-serializable-reads-00.pdf pprof.etcd-v3.0.2-go1.7.3.localhost:2379.samples.cpu.001.pb.gz

Thanks a lot!

timothysc commented 7 years ago

/cc @smarterclayton @wojtek-t

bradfitz commented 7 years ago

I haven't seen any update on this bug about anything Go 1.8-related, so I'm going to kick this down the road further.

Please try Go 1.8 and file bugs about any performance problems, thanks!

In the future, please start your trials of new Go versions earlier. Testing the move from Go 1.6 to Go 1.7 doesn't help us much when we're just about done with Go 1.8.

gyuho commented 7 years ago

@bradfitz Thanks!

I ran the same tests with go1.6.4 vs go1.8beta1, and got same results

# all single-node cluster
go1.6.4     took   49.8130 secs.
go1.8beta1  took   100.3238 secs.

Here are benchmark binaries and profiled data:

go1.6.4 etcd-v3.0.2-go1.6.4-2M-serializable-reads-2016121501.pdf pprof.localhost-2379.samples.cpu.001.pb.gz

go1.8beta1 etcd-v3.0.2-go1.8beta1-2M-serializable-reads-2016121501.pdf pprof.etcd-v3.0.2-go1.8beta1.localhost-2379.samples.cpu.001.pb.gz

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.6.4
wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.8beta1
wget https://storage.googleapis.com/etcd/tip/etcdctl-master-go1.8beta1
wget https://storage.googleapis.com/etcd/tip/benchmark-master-go1.8beta1

sudo chmod +x ./etcd-v3.0.2-go1.6.4         && ./etcd-v3.0.2-go1.6.4 --version
sudo chmod +x ./etcd-v3.0.2-go1.8beta1      && ./etcd-v3.0.2-go1.8beta1 --version
sudo chmod +x ./etcdctl-master-go1.8beta1   && ./etcdctl-master-go1.8beta1 --version
sudo chmod +x ./benchmark-master-go1.8beta1 && ./benchmark-master-go1.8beta1 help

bradfitz commented 7 years ago

Leaving for @aclements to decide what to do here, and when.

TocarIP commented 7 years ago

I've figured out, whats going on. I was able to reproduce this issue only with GOMAXPROCS >= 3 This started happening on https://github.com/golang/go/commit/a6fb2aede7c8d47b4d913eb83fa45bbeca76c433 Which enables new ssa compiler. Because of that, stack frames are now shorter, so stack growth happens while locks are taken. Previously due lo larger stackframes stack growth was happening outside of critical section. I've verified this by enabling blocking profiler, and indeed in bad case much more time is spent waiting on locks: 214.42mins of 215.35mins total (99.57%) in bad case 2376s of 2376.23s total ( 100%) in good case From looking at profiles, it seems like the guilty lock is the one in coreos/etcd/etcdserver.(*authApplierV3).Apply, but I didn't verify this.

I've verified that following approaches make this regression disappear: 1) Bumping default stack size to 8kb or changing growth factor to 4 2) Putting large array on stack to force stack growth earlier. E. g:

+var q [160]int
+
 func metricsUnaryInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) {
        service, method := splitMethodName(info.FullMethod)
        receivedCounter.WithLabelValues(service, method).Inc()
 -
+       var ccc [160]int
        start := time.Now()
        resp, err = handler(ctx, req)
        if err != nil {
                failedCounter.WithLabelValues(service, method, grpc.Code(err).String()).Inc()
        }
        handlingDuration.WithLabelValues(service, method).Observe(time.Since(start).Seconds())
 -
+       q = ccc
        return resp, err
}

However both of them look like hacks to me. As for real solution, I think that something like @aclements suggestion from https://github.com/golang/go/issues/18138 about allocating larger stack for functions that are known to require it, is the right way.

xiang90 commented 7 years ago

@TocarIP

From looking at profiles, it seems like the guilty lock is the one in coreos/etcd/etcdserver.(*authApplierV3).Apply, but I didn't verify this.

I can confirm this. We also suspect that there is a more serious lock contention triggered by some runtime changes. So we change the locking mechanism, the problem disappeared.

But as you mentioned, this issue is probably still worth fixing.

gopherbot commented 7 years ago

CL https://golang.org/cl/45142 mentions this issue.

aclements commented 7 years ago

@TocarIP thanks for the excellent analysis.

If you can still reproduce this, I'd be curious what the effect of CL 45142 is. I'm really not sure what the right answer here is, but that hack might mitigate things from both this issue and #18138.

I've been thinking for a while that the latency of stack growth can be problematic, but this is an interesting case where it's not just a problem with large stacks. The old segmented stacks approach had less of this problem, but it had other anomalies. I wonder if it's possible to efficiently but incrementally move a stack.

TocarIP commented 7 years ago

I've tested CL 45142, and it doesn't help

josharian commented 7 years ago

At my request, @TocarIP also checked CL 43150 (thanks!). It's not a deep fix, but it does at least help a little--from 13.4s to 12.4s.

josharian commented 6 years ago

I’ve done a bit of work on stack copying this cycle. Much of it is in. https://golang.org/cl/109001 is under review. I have one more I hope to mail soon, but no promises. If anyone can still reproduce readily, I’d be curious to hear whether / how much tip has improved, and whether the linked and future CLs help.

evanj commented 2 years ago

From briefly reading this issue, it sounds like this might be resolved with https://github.com/golang/go/commit/016d7552138077741a9c3fdadc73c0179f5d3ff7 which should be included in Go 1.19.

golang / go