golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.86k stars 17.65k forks source link

runtime: new goroutines can spend excessive time in morestack #18138

Open petermattis opened 7 years ago

petermattis commented 7 years ago

What version of Go are you using (go version)?

go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64 a.k.a go1.8beta1

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build385423377=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

A recent change to github.com/cockroachdb/cockroach replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in runtime.morestack. The problematic goroutines are all hitting a single gRPC entrypoint Server.Batch and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.

The expensive part of runtime.morestack is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:

name               old time/op  new time/op  delta
KVInsert1_SQL-8     339µs ± 2%   312µs ± 1%   -7.89%  (p=0.000 n=10+10)
KVInsert10_SQL-8    485µs ± 2%   471µs ± 1%   -2.81%  (p=0.000 n=10+10)
KVInsert100_SQL-8  1.36ms ± 0%  1.35ms ± 0%   -0.95%  (p=0.000 n=10+10)
KVUpdate1_SQL-8     535µs ± 1%   487µs ± 1%   -9.02%   (p=0.000 n=10+9)
KVUpdate10_SQL-8    777µs ± 1%   730µs ± 1%   -6.03%   (p=0.000 n=10+9)
KVUpdate100_SQL-8  2.69ms ± 1%  2.66ms ± 1%   -1.16%  (p=0.000 n=10+10)
KVDelete1_SQL-8     479µs ± 1%   429µs ± 2%  -10.43%   (p=0.000 n=9+10)
KVDelete10_SQL-8    676µs ± 1%   637µs ± 1%   -5.80%    (p=0.000 n=9+9)
KVDelete100_SQL-8  2.23ms ± 5%  2.18ms ± 4%     ~     (p=0.105 n=10+10)
KVScan1_SQL-8       216µs ± 5%   179µs ± 1%  -17.12%  (p=0.000 n=10+10)
KVScan10_SQL-8      233µs ± 1%   201µs ± 1%  -13.76%  (p=0.000 n=10+10)
KVScan100_SQL-8     463µs ± 1%   437µs ± 0%   -5.64%   (p=0.000 n=10+8)

old are benchmarks gathered using go1.8beta1 and new are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning of server.Batch to a growStack method:

var growStackGlobal = false

//go:noinline
func growStack() {
    // Goroutine stacks currently start at 2 KB in size. The code paths through
    // the storage package often need a stack that is 32 KB in size. The stack
    // growth is mildly expensive making it useful to trick the runtime into
    // growing the stack early. Since goroutine stacks grow in multiples of 2 and
    // start at 2 KB in size, by placing a 16 KB object on the stack early in the
    // lifetime of a goroutine we force the runtime to use a 32 KB stack for the
    // goroutine.
    var buf [16 << 10] /* 16 KB */ byte
    if growStackGlobal {
        // Make sure the compiler doesn't optimize away buf.
        for i := range buf {
            buf[i] = byte(i)
        }
    }
}

The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the growStackGlobal is not currently necessary, but I wanted to future proof against the compiler deciding that buf is not necessary.

Longer term, the stack usage under server.Batch should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.

bradfitz commented 7 years ago

/cc @aclements @randall77

aclements commented 7 years ago

We've seen this a few times now. I'm not sure what the right answer is. My best thought so far is that the runtime could keep track of when particular go statements always lead to stack growth right away (for some value of "right away" and "always") and learn to start goroutines from that site with a larger stack. Of course, it would be hard to make this behavior predictable, but perhaps it would still be less surprising than the current behavior. If the runtime did learn to start a goroutine with a larger stack, it would still need a signal to learn if the stack should get smaller again, but we could do that efficiently by allocating the larger stack but setting the stack bounds to something smaller. Then the runtime could still observe whether or not the stack needs to grow, but the actual growth would be basically free until it reached the size of the allocation.

@randall77, thoughts, ideas?

/cc @RLH

mrjrieke commented 7 years ago

I like @petermattis idea of being able to hint stack size on a per goroutine basis, although this implies developers have the knowhow to identify and provide size estimates accurately. Could this be done with a compiler directive?

bradfitz commented 7 years ago

We don't want compiler directives in code. We have some used by the runtime out of necessity, but they're gross. Go prefers simplicity over tons of knobs.

petermattis commented 7 years ago

Yes, please just make my code magically faster as you've been doing for the last several Go releases.

mrjrieke commented 7 years ago

I generally agree with not having compiler directives ... magic is nice, although they (compiler directives) do exist even in go. It's an interesting opportunity either way you decide.

mrjrieke commented 7 years ago

@bradfitz, your comment prompted me to look for the go guiding principles ( https://golang.org/doc/faq#principles). Thanks @adg as well for nicely worded principles.

gopherbot commented 7 years ago

CL https://golang.org/cl/45142 mentions this issue.

aclements commented 7 years ago

@petermattis (or anyone who has a good reproducer for this), would you be able to try https://go-review.googlesource.com/45142? It's a trivial hack, but it might actually do the trick. I haven't benchmarked it on anything, so it may also slow things down.

petermattis commented 7 years ago

@aclements I'll try and test either tomorrow or next week.

petermattis commented 7 years ago

@aclements Applying that patch to go1.8.3 resulted in no benefit (this is with the growStack hack disabled):

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new
name                old time/op  new time/op  delta
KV/Insert1_SQL-8     363µs ± 3%   369µs ± 2%  +1.43%  (p=0.043 n=10+10)
KV/Insert10_SQL-8    583µs ± 0%   581µs ± 1%    ~     (p=0.113 n=10+9)
KV/Insert100_SQL-8  2.05ms ± 0%  2.05ms ± 1%    ~     (p=0.912 n=10+10)
KV/Update1_SQL-8     578µs ± 1%   577µs ± 1%    ~     (p=0.968 n=9+10)
KV/Update10_SQL-8    913µs ± 1%   914µs ± 1%    ~     (p=0.931 n=9+9)
KV/Update100_SQL-8  3.80ms ± 1%  3.87ms ± 5%  +1.90%  (p=0.019 n=10+10)
KV/Delete1_SQL-8     517µs ± 2%   518µs ± 2%    ~     (p=0.912 n=10+10)
KV/Delete10_SQL-8    813µs ± 2%   809µs ± 1%    ~     (p=0.280 n=10+10)
KV/Delete100_SQL-8  3.22ms ± 2%  3.26ms ± 3%    ~     (p=0.052 n=10+10)
KV/Scan1_SQL-8       217µs ± 1%   216µs ± 0%    ~     (p=0.090 n=9+10)
KV/Scan10_SQL-8      238µs ± 0%   238µs ± 1%    ~     (p=0.122 n=10+8)
KV/Scan100_SQL-8     454µs ± 0%   455µs ± 1%    ~     (p=0.089 n=10+10)

Surprising to me this didn't have any effect. Compare this to the growStack hack mentioned earlier:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.grow-stack
name                old time/op  new time/op  delta
KV/Insert1_SQL-8     363µs ± 3%   331µs ± 2%   -8.82%  (p=0.000 n=10+10)
KV/Insert10_SQL-8    583µs ± 0%   561µs ± 1%   -3.80%  (p=0.000 n=10+10)
KV/Insert100_SQL-8  2.05ms ± 0%  2.03ms ± 0%   -0.88%  (p=0.000 n=10+8)
KV/Update1_SQL-8     578µs ± 1%   532µs ± 1%   -7.94%  (p=0.000 n=9+10)
KV/Update10_SQL-8    913µs ± 1%   872µs ± 1%   -4.47%  (p=0.000 n=9+9)
KV/Update100_SQL-8  3.80ms ± 1%  3.75ms ± 1%   -1.36%  (p=0.000 n=10+10)
KV/Delete1_SQL-8     517µs ± 2%   458µs ± 2%  -11.54%  (p=0.000 n=10+10)
KV/Delete10_SQL-8    813µs ± 2%   765µs ± 1%   -5.91%  (p=0.000 n=10+10)
KV/Delete100_SQL-8  3.22ms ± 2%  3.16ms ± 1%   -2.01%  (p=0.000 n=10+10)
KV/Scan1_SQL-8       217µs ± 1%   194µs ± 1%  -10.44%  (p=0.000 n=9+10)
KV/Scan10_SQL-8      238µs ± 0%   216µs ± 1%   -9.36%  (p=0.000 n=10+10)
KV/Scan100_SQL-8     454µs ± 0%   431µs ± 1%   -4.92%  (p=0.000 n=10+9)
josharian commented 7 years ago

CL 43150 might help a little here.

aclements commented 7 years ago

Sorry, I made a silly mistake in CL 45142. Would you mind trying the new version of that CL?

petermattis commented 7 years ago

With your updated patch against go-tip (f363817f143a7cee0ccdb7833a99ac1e8f210197) there is an improvement:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     243µs ± 1%   224µs ± 0%  -7.57%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    263µs ± 0%   247µs ± 0%  -6.20%  (p=0.000 n=9+10)
KV/Scan100_SQL-8   463µs ± 0%   444µs ± 0%  -4.05%  (p=0.000 n=10+10)

But the improvement is still not as good as the growStack hack:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.new out.grow-stack
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     224µs ± 0%   219µs ± 0%  -2.24%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    247µs ± 0%   240µs ± 1%  -2.59%  (p=0.000 n=10+10)
KV/Scan100_SQL-8   444µs ± 0%   439µs ± 0%  -1.06%  (p=0.000 n=10+9)

There is a little more performance if we increase the initial stack size to 32 KB:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new2
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     243µs ± 1%   209µs ± 1%  -13.76%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    263µs ± 0%   232µs ± 2%  -11.61%  (p=0.000 n=9+10)
KV/Scan100_SQL-8   463µs ± 0%   445µs ± 4%   -3.86%  (p=0.000 n=10+9)

Interestingly, all of these timings are lower than with go1.8.3.

petermattis commented 7 years ago

Interestingly, all of these timings are lower than with go1.8.3.

Nothing to see here. This appears to be due to a change on our code between what I tested earlier today and now.

petermattis commented 7 years ago

I did some more testing of this patch and the performance improvements carry over to production settings. morestack disappears from profiles. Note this is using a version of the patch which uses a 32KB initial stack size.

petermattis commented 7 years ago

It is early in the 1.10 cycle and wanted to bring this issue forward again. See https://github.com/cockroachdb/cockroach/issues/17242 for a graph showing the benefit of a larger initial stack size.

petermattis commented 7 years ago

Is there any update on this issue? A larger initial goroutine stack size provides a nice performance boost for our system.

richardartoul commented 6 years ago

@petermattis We ran into a similar issue in production where adding a single extra function call (few lines of code, nothing heavy) increased our end-to-end latency by a few seconds. We narrowed down the issue to an increase in runtime.morestack. We ended up resolving the issue with this worker pool: https://github.com/m3db/m3x/blob/master/sync/pooled_worker_pool.go

I noticed that you said:

I can see the stack growing in 4 steps from 2 KB to 32 KB.

Did you have any way of programatically measuring when the stack grows? I would love to be able to prove in production that the additional function we added was enough to push the stack over the edge into one more copy, but all I could do was make inferences based on the CPU profiles.

josharian commented 6 years ago

If you can reproduce in a non-production environment, you might want to try tip of Go 1.11. There were a few improvements made this cycle that could help a bit.

Again, in a non-production environment, you could simply add a println(“grow”) to the implementation of morestack in the runtime. In a production environment, you might want to atomically increment a package variable instead and read it out atomically in another goroutine.

You might also want to file an issue requesting a built-in pprof profile for stack size changes, although even if approved that couldn’t land until 1.12.

richardartoul commented 6 years ago

@josharian I'm happy to request a built-in pprof profile for stack size changes if you think that's possible. The main issue with the current CPU profiles is that you can see its happening, but can't trace what is triggering it

josharian commented 6 years ago

Please file an issue, and we can discuss feasibility there. If you cc me I’ll pull in other relevant folks. Thanks!

petermattis commented 6 years ago

Did you have any way of programatically measuring when the stack grows? I would love to be able to prove in production that the additional function we added was enough to push the stack over the edge into one more copy, but all I could do was make inferences based on the CPU profiles.

@richardartoul I used the following patch (against go1.9):

diff --git a/src/runtime/stack.go b/src/runtime/stack.go
index 525d0b14c1..4615defa67 100644
--- a/src/runtime/stack.go
+++ b/src/runtime/stack.go
@@ -910,6 +910,8 @@ func round2(x int32) int32 {
        return 1 << s
 }

+var debugStack = gogetenv("GOSTACK") == "1"
+
 // Called from runtime·morestack when more stack is needed.
 // Allocate larger stack and relocate to new stack.
 // Stack growth is multiplicative, for constant amortized cost.
@@ -937,6 +939,14 @@ func newstack(ctxt unsafe.Pointer) {
        // morestack so it has the necessary write barrier.
        gp.sched.ctxt = ctxt

+       if debugStack {
+               oldsize := gp.stack.hi - gp.stack.lo
+               newsize := oldsize * 2
+               print("runtime: newstack: ", oldsize, " -> ", newsize, "\n")
+               morebuf := thisg.m.morebuf
+               traceback(morebuf.pc, morebuf.sp, morebuf.lr, morebuf.g.ptr())
+       }
+
        if thisg.m.curg.throwsplit {
                // Update syscallsp, syscallpc in case traceback uses them.
                morebuf := thisg.m.morebuf
richardartoul commented 6 years ago

@petermattis Awesome, thanks! Will report back.

@josharian I will file an issue soon, just want to gather all my evidence as its kind of scattered around right now.

josharian commented 6 years ago

Don’t worry about evidence or a big write up. For this, a few sentences and a reference back to this issue suffices.

CAFxX commented 5 years ago

Just a note: I have a branch where I started implementing PC-based stack estimation. Basically, when a G is created it checks if we have an estimate for the most common stack size used by Gs started at the same PC. It shows potential but it's pretty crude ~because it just measures stack size instead of actual stack utilization (estimates are reset during GC to prevent infinite growth). Measuring (or at least estimating) actual stack utilization would likely require modifying the morestack machinery (but that seems to involve modifying a lot of assembler code of platforms I'm not familiar with)~. It would also benefit from compiler support, but that's less important.

If anybody wants to give it a spin, it's here: https://github.com/golang/go/compare/master...CAFxX:dynstack-local

By default stack estimation is enabled. Running the built programs using GOSTACKESTIM=0 disables stack estimation (i.e. it uses the normal behavior of the runtime). ~GOSTACKESTIMDEBUG=1~ GOSTACKESTIM=2 will print a line every time a goroutine is started with a non-standard stack size as a result of stack estimation.

update: I implemented also effective stack usage estimation, turns out it didn't require modifying any assembler code.

aclements commented 5 years ago

For reference, this issue was just discussed at length in an Uber Engineering blog post (issue #31558 pointed that out).

CAFxX commented 5 years ago

@aclements regarding this issue, and specifically my experiment above, I ran into issues while implementing stack estimation as the runtime currently implicitly assumes that the stack guard size is constant in many different points throughout the code.

This is more of a wish than anything since you are reworking exactly that (https://go-review.googlesource.com/c/go/+/172284/) but... if you could, in your CLs, "centralize" the assumption so that it's not spread around the runtime, I would be in a much better position to wrap up my attempt at fixing this. (I actually haven't reviewed that patch series, maybe you have already done this)

The reason I need this is that to estimate stack usage I allocate a stack of size N and set the stack guard so that morestack is called when more than N/2 is in use, at which point the stack guard reverts to the default. When the G dies I check if the stack guard is still at N/2 or if it's default; in the former case I know the G never used more than half the estimated space, and I can adjust the estimation accordingly.

lrita commented 5 years ago

In my service, we using the classical server mode, which spawn a goroutine by a new connection. There are many different clients, someone use long term connection, and the others use a short term connection, and we cannot optimize those short term clients.

We use snappy compressing for network communicating. The snappy.encodeBlock need a 32k stack, that spend excessive time in morestack for short term connection clients.

In practice, we does not want wait a long long time for a good, final solution. I think compiler directives is a simply and fast way to resolve these relevant performance issue, although it is gross. We can benefit from it soon.

valyala commented 5 years ago

I believe high stack usage in snappy.encodeBlock must be fixed, since it results in high memory usage in apps serving high number of concurrent connections. Stack hog functions could be detected with the proposed stack profiler.

marcogrecopriolo commented 4 years ago

Just a heads up that this is an issue for Couchbase's N1QL as well. Personally, a directive would work well for N1QL (we only have one entry point where we need to run with a different stack size), but I understand the reluctance to use explicit controls. OTOH - having to build a whole infrastructure of goroutine workers seems a high price to pay for one very small ideological sin (one directive). Maybe we could have a little map of frequently used entry points holding each point's average stack size on exit?

uluyol commented 4 years ago

Not sure if this issue was noted anywhere. Here is a sample program that has enough static information to avoid any calls to morestack but in fact observes multiple stack growths:

package main

var shouldSet = false
var c = make(chan bool)
var x [16384]byte

func main() {
    go f32()
    <-c
    println("exit")
}

//go:noinline
func f32() {
    var buf [32]byte
    if shouldSet {
        // Make sure the compiler doesn't optimize away buf.
        for i := range buf {
            buf[i] = byte(i)
        }
        copy(x[:], buf[:])
    }
    f64()
}

[...]

//go:noinline
func f16384() {
    var buf [16384]byte
    if shouldSet {
        // Make sure the compiler doesn't optimize away buf.
        for i := range buf {
            buf[i] = byte(i)
        }
        copy(x[:], buf[:])
    }
    println("done")
    c <- true
}

Each function fX allocates a variable of X bytes on the stack and unconditionally calls another function f(2*X) up to 16KB. When go f32() is called, the compiler should have enough information to allocate a large-enough stack frame up front (it knows that f32 needs to allocate stack space for f64, f128, and so on). What we see instead are multiple calls to newstack:

``` runtime: newstack: 2048 -> 4096 runtime.(*mcentral).grow(0x10f6b98, 0x0) /Users/m/dev/go/src/runtime/mcentral.go:264 +0x13d runtime.(*mcentral).cacheSpan(0x10f6b98, 0x1402200) /Users/m/dev/go/src/runtime/mcentral.go:106 +0x2bc runtime.(*mcache).refill(0x1121108, 0xe) /Users/m/dev/go/src/runtime/mcache.go:138 +0x84 runtime.(*mcache).nextFree(0x1121108, 0xe, 0x1402300, 0x15fffff, 0xc00002e5c0) /Users/m/dev/go/src/runtime/malloc.go:867 +0x87 runtime.mallocgc(0x60, 0x10700e0, 0xc00002e601, 0x49) /Users/m/dev/go/src/runtime/malloc.go:1047 +0x792 runtime.newobject(0x10700e0, 0x100a024) /Users/m/dev/go/src/runtime/malloc.go:1176 +0x38 runtime.acquireSudog(0xc000044048) /Users/m/dev/go/src/runtime/proc.go:344 +0x281 runtime.chanrecv(0xc000044000, 0x0, 0xc000000101, 0x101204e) /Users/m/dev/go/src/runtime/chan.go:551 +0x223 runtime.chanrecv1(0xc000044000, 0x0) /Users/m/dev/go/src/runtime/chan.go:433 +0x2b runtime.gcenable() /Users/m/dev/go/src/runtime/mgc.go:216 +0x95 runtime.main() /Users/m/dev/go/src/runtime/proc.go:167 +0x115 runtime.goexit() /Users/m/dev/go/src/runtime/asm_amd64.s:1374 +0x1 runtime: newstack: 2048 -> 4096 main.f512() /Users/m/dev/go/nomorestack/main.go:75 +0x86 main.f256() /Users/m/dev/go/nomorestack/main.go:62 +0x86 main.f128() /Users/m/dev/go/nomorestack/main.go:49 +0x83 main.f64() /Users/m/dev/go/nomorestack/main.go:36 +0x80 main.f32() /Users/m/dev/go/nomorestack/main.go:23 +0x5e created by main.main /Users/m/dev/go/nomorestack/main.go:8 +0x35 runtime: newstack: 4096 -> 8192 main.f1024() /Users/m/dev/go/nomorestack/main.go:88 +0x91 main.f512() /Users/m/dev/go/nomorestack/main.go:75 +0x86 main.f256() /Users/m/dev/go/nomorestack/main.go:62 +0x86 main.f128() /Users/m/dev/go/nomorestack/main.go:49 +0x83 main.f64() /Users/m/dev/go/nomorestack/main.go:36 +0x80 main.f32() /Users/m/dev/go/nomorestack/main.go:23 +0x5e created by main.main /Users/m/dev/go/nomorestack/main.go:8 +0x35 runtime: newstack: 8192 -> 16384 main.f2048() /Users/m/dev/go/nomorestack/main.go:101 +0x81 main.f1024() /Users/m/dev/go/nomorestack/main.go:88 +0x91 main.f512() /Users/m/dev/go/nomorestack/main.go:75 +0x86 main.f256() /Users/m/dev/go/nomorestack/main.go:62 +0x86 main.f128() /Users/m/dev/go/nomorestack/main.go:49 +0x83 main.f64() /Users/m/dev/go/nomorestack/main.go:36 +0x80 main.f32() /Users/m/dev/go/nomorestack/main.go:23 +0x5e created by main.main /Users/m/dev/go/nomorestack/main.go:8 +0x35 runtime: newstack: 16384 -> 32768 main.f4096() /Users/m/dev/go/nomorestack/main.go:114 +0x97 main.f2048() /Users/m/dev/go/nomorestack/main.go:101 +0x81 main.f1024() /Users/m/dev/go/nomorestack/main.go:88 +0x91 main.f512() /Users/m/dev/go/nomorestack/main.go:75 +0x86 main.f256() /Users/m/dev/go/nomorestack/main.go:62 +0x86 main.f128() /Users/m/dev/go/nomorestack/main.go:49 +0x83 main.f64() /Users/m/dev/go/nomorestack/main.go:36 +0x80 main.f32() /Users/m/dev/go/nomorestack/main.go:23 +0x5e created by main.main /Users/m/dev/go/nomorestack/main.go:8 +0x35 runtime: newstack: 32768 -> 65536 main.f8192() /Users/m/dev/go/nomorestack/main.go:127 +0x97 main.f4096() /Users/m/dev/go/nomorestack/main.go:114 +0x97 main.f2048() /Users/m/dev/go/nomorestack/main.go:101 +0x81 main.f1024() /Users/m/dev/go/nomorestack/main.go:88 +0x91 main.f512() /Users/m/dev/go/nomorestack/main.go:75 +0x86 main.f256() /Users/m/dev/go/nomorestack/main.go:62 +0x86 main.f128() /Users/m/dev/go/nomorestack/main.go:49 +0x83 main.f64() /Users/m/dev/go/nomorestack/main.go:36 +0x80 main.f32() /Users/m/dev/go/nomorestack/main.go:23 +0x5e created by main.main /Users/m/dev/go/nomorestack/main.go:8 +0x35 done exit ```

Unfortunately, I don't have any idea of how much fixing this would help in practice.

gopherbot commented 4 years ago

Change https://golang.org/cl/225800 mentions this issue: runtime: grow stack more than 2x if the new frame is large

marcogrecopriolo commented 4 years ago

Is there any way https://golang.org/cl/225800 could be merged in 1.14.x?

In between pooling goroutines and growing the stack with a dummy function call, the second is the better option (too much contention operating the goroutine queues), however guessing the stack is a bit of a dark art: too small and you achieve nothing, too big and you get a huge bottleneck in stackcacherefill().

If we could allocate a nice round number in one go without having to worry about if it is slightly too big, it would make our life easier: at least we don't have to experiment to find the right stack size.

Also - if you want numbers from a real life high throughput service: untreated, this is issue is eating up 12% of our overall CPU time. Hacking the stack grows our throughput by 6% and still needlessly uses 6% CPU time. Pooling goroutines on a partitioned queue (as many fragments as cores) removes the newstack() CPU load, but the contention in managing the queues is such that we improve throughput by 1% or thereabout.

Starting the goroutine with a larger stack size would be best.

aclements commented 4 years ago

@marcogrecopriolo, I'm not sure I quite understand:

Hacking the stack grows our throughput by 6% and still needlessly uses 6% CPU time.

Do you mean that CL 225800 increases your throughput by 6%, or something else?

Also - if you want numbers from a real life high throughput service: untreated, this is issue is eating up 12% of our overall CPU time.

That is interesting. Can you tell from profile stacks that this is definitely happening right after new goroutines start?

Starting the goroutine with a larger stack size would be best.

This is a very tricky trade-off, since many systems that create large numbers of goroutines benefit from the small starting stacks. There's no simple answer here, I'm afraid. :( CL 225800 has basically no downside, so I would love to know if it actually does help.

marcogrecopriolo commented 4 years ago

What I mean is that if I use code like this to fire my goroutine:

// MB-38469 / go issue 18138 initial goroutine stack too small //go:noinline func primeStack() { const _STACK_BUF_SIZE = 512 // 128 multiples var buf [_STACK_BUF_SIZE]int64

// force the compiler to allocate buf
for i := 127; i < _STACK_BUF_SIZE; i += 128 {
    buf[i] = int64(i)
}

_ = stackTop(buf[_STACK_BUF_SIZE-1])

}

//go:noinline func stackTop(v int64) int64 { return v }

func execOp(op Operator, context *Context, parent value.Value) { primeStack() op.RunOnce(context, parent) }

// fork operator func (this base) fork(op Operator, context Context, parent value.Value) { if op.getBase().inline { this.switchPhase(_NOTIME) op.RunOnce(context, parent) this.switchPhase(_EXECTIME) } else { go execOp(op, context, parent) // go op.RunOnce(context, parent) } }

(go op.RunOnce(context, parent) is how I would normally call it), and size STACK_BUF_SIZE roughly speaking right, I get better throughput, and lower the cost of newstack(). In my case 512 is giving me +6% throughput and half the cost in newstack().

If I get STACK_BUF_SIZE wrong, I either get a huge bottleneck in stackcacherefill() and a drop in throughput. In my case STACK_BUF_SIZE set to 1024 gets me a 10% throughput drop and newstack() taking twice as much CPU time, mostly spent in stackcacherefill() piling up in some lock.

I would need to spend more time finding the optimal value for STACK_BUF_SIZE (I know it should be lower than 512), but it takes me more than an hour to run my performance test rig, and even then, somebody might actually increase their stack need down the line and render my testing useless.

If I could have 225800 merged, then I wouldn't need to worry about contention in stackcacherefill(), I'd just size my stack to a reasonable amount and never worry about it ever again.

As a side note, I know we want to avoid compilers directives, but all of the above is just a poor man's way to say

//go:stack=4196 go op.RunOnce(context, parent)

Which would be cleaner. That would give the option of creating goroutines with larger stacks, where specifically required and when growing stacks is known to be a cost, and systems that create large numbers of goroutines would still benefit from small stacks as usual.

marcogrecopriolo commented 4 years ago

Sorry - I should clarify, from extensive profiling I know that newstack() is only called in primestack(), so yes it only happens at goroutine starts.

roudkerk commented 3 years ago

Rather than a compiler directive, would it be possible to add functions to the runtime package

GoWithStackHint(f func(), stackSize int)
StackSize() int

The first could be used to start goroutines with statically determined stack size, for example

runtime.GoWithStackHint(func() { op.RunOnce(context, parent) }, 4096)

But it would also be possible to write a helper that dynamically chooses stack size based on the history of final stack sizes for goroutines created with that helper instance. For example

var runner = smartstacksize.New()

...

  runner.Go(func() { op.RunOnce(context, parent) })
gopherbot commented 3 years ago

Change https://golang.org/cl/341990 mentions this issue: runtime: predict stack sizing

CAFxX commented 3 years ago

Forgot to mention it back here, but I have a CL up for early review that basically implements something similar to what @aclements suggested. It uses a fairly conservative approach, so in this early incarnation it may still leave a bit of performance on the table, but in the tests so far it seems to work well enough in practice to be useful. If anyone wants to run their own benchmarks and report back it would be great (you need to build from that CL and then set GOEXPERIMENT=predictstacksize). The upside is that it requires no knobs, annotations, or code changes.

gopherbot commented 3 years ago

Change https://golang.org/cl/345889 mentions this issue: runtime: measure stack usage; start stacks larger if needed

randall77 commented 3 years ago

I wrote up an idea I had about starting goroutines with a larger than minimum stack size, for this issue. The doc is here. The immediate impetus for this doc was another attempt to fix that issue in CL 341990, but generally these ideas have been sloshing around my head for a while. Comments welcome. I have a first stab at an implementation in CL 345889.

zigo101 commented 3 years ago

Improve @uluyol's idea by setting the initial stack size of a new goroutine to any 2n sizes:

func startRoutine() {
    // Use a dummy anonymous function to enlarge stack.
    func(x *interface{}) {
        type _ int // avoid being inlined
        if x != nil {
            *x = [128 << 20]byte{} // initial 256M stack
        }
    }(nil)

    // ... do work load
}

[update]: a demo: https://play.golang.org/p/r3t_OXxTvt7

zigo101 commented 3 years ago

Rather than a compiler directive, would it be possible to add functions to the runtime package

It would be great to add a runtime/debug/SetCurrentGoroutineOption function:

type GoroutineOption int
func SetCurrentGoroutineOption(key GoroutineOption, value int) {...}

const (
    StackSizeOfTheNextSpawndGoroutine GoroutineOption = iota
    PriorityCasesInTheNextSelectBlock
    ...
)
pcostanza commented 2 years ago

It would be good if Go would support tail call elimination. Then at least in some cases, the stack wouldn't grow as much as it does now.

ianlancetaylor commented 2 years ago

A straightforward implementation of tail call elimination would lead to difficulties with runtime.Caller and runtime.Callers.

See also #22624.

pcostanza commented 2 years ago

"Proper" tail recursion requires the full removal of the caller's stack frame when a function is call in tail position. If that is too extreme, maybe it's possible to only remove the part of the stack frame that is not needed for runtime.Caller and runtime.Callers, in the hope to reduce time spent in morestack?

CAFxX commented 2 years ago

only remove the part of the stack frame that is not needed for runtime.Caller and runtime.Callers, in the hope to reduce time spent in morestack

See https://github.com/golang/go/issues/36067

zigo101 commented 2 years ago

It looks a new implementation will be adopted in Go 1.19. After reading the design doc of the change, it looks now a global average stack size is used as the initial stack size for new spawn goroutines. Will this be efficient if the variance of goroutine stack sizes is large?

randall77 commented 2 years ago

Yes. Of course, it matters what exactly you mean by "efficient". The new behavior uses a bit more space for a bit less CPU.

Starting new goroutines with the average stack size will waste at most a factor of 2 in space (in addition to the at-most factor of 2 we already waste by growing stacks using power-of-two sizes). In exchange, we get less work on goroutine startup.

It is conceivable that the tradeoff is not advantageous for some programs. I suspect it will be a net win for most. I'm curious if anyone has an example where it is not the right thing to do.