golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.93k stars 17.65k forks source link

proposal: cmd/trace: add 'greedy goroutine diagnosis' option #20792

Open hnes opened 7 years ago

hnes commented 7 years ago

(Maybe helpful to bypass the issue #10958 in a quite different but probably better and more clean way. Coming from this thread, many thanks to the participants of this discussion)

Hi all :)

I have implemented a sub-option in the go tool trace recently named "diagreedy" which means "diagnoses and finds out the greediest several goroutines". This tool already helped us tracked down several deep hidden problems in our go applications and achieved more stable and short GC pause latency by fixing them afterwards.

Terminology

N        = GOMAXPROCS 

Tmax_stw = sum{top max N on-processor time slices}

Tgc      = Tmax_stw + Tgcwork

Big Picture

             t1=12           t2=11           t3=10
      |                |              |               | 
      __________________________________________________
      |                |              |                 |
      ---------------------------------------------------> time line 
P0    [------go 1------][-go 7-][-go 1-][-go 6-][-go 4-]
P1    [-go 3-][-go 4-][------go 2-----][-go 1-][-go 2-]
P2    [-go 4-][-go 3-][-go 8-][-go 9-][-----go 10-----]

In this diagram, the value of N is 3 (i.e. the GOMAXPROCS), and the top 3 long on-processor time slices is t1, t2, t3 and they are corresponding to goroutine 1, 2, 10.

The max time cost of the pre-GC-action Stop-the-World is named Tmax_stw and is the sum of t1, t2 and t3.

The Tgcwork is the time cost by the acually effective Garbage Collecting action after the phase Stop-the-World. Tgcwork is determined by the amount of the allocated objects and the complexity of the relationship among them.

Tmax_stw and Tgcwork together finally determined the one typical GC pause cost. As for Tgcwork, we could optimize it by reusing the objects (via sync.Pool for example) and by many other methods. But we do not discuss Tgcwork futher here. We concentrate on the Tmax_stw right now.

We could get this conclusion plainly:

The bigger the Tmax_stw is and the greedier and less cooperative the goroutines are, 
the more thrashing and churning our go applications would become, thus the worse 
latency our programs would perform (caused by the GC pause).

For some cpu intensive applications, the top max N on-processor time slices is often very big (if without very carefully task splits). Especially when our project is including tons of other and heavily used open source projects which will cause it becomes more complex and harder to trace.

Although we already had some tools to sample out the top n cpu usage of the functions. But the accumulated cpu usage of the functions and the longest top n time slices is not the same thing at all. For example, if we have 1000 different cpu intensive goroutines something like:

go func(){
    do_cpu_staff() -- time cost 1ms
    time.Sleep(time.Second)
}()

Our Tmax_stw would become GOMAXPROCS*1 ms. It's terrible because sometimes we would set the GOMAXPROCS to a pretty big value (16 or even 36 for instance). With the sample of top n cpu usage, we may couldn't find out the sources of the problem easily. But with the 'top max N on-processor time slices' approach, we could track down the sources of the long GC pause immediately and fix them like this:

go func(){
    do_cpu_staff0() -- time cost 200us
    runtime.Gosched()
    do_cpu_staff1() -- time cost 200us
    runtime.Gosched()
    do_cpu_staff2() -- time cost 200us
    runtime.Gosched()
    do_cpu_staff3() -- time cost 200us
    runtime.Gosched()
    do_cpu_staff4() -- time cost 200us
    time.Sleep(time.Second)
}()

Then our Tmax_stw would turns from GOMAXPROCS*1 ms to GOMAXPROCS/5 ms afterwards :) (If your GOMAXPROCS is very big, just splits do_cpu_staff even further. And you even could implement some logic which would let the do_cpu_staff automated splits its tasks based on the value of GOMAXPROCS and a other parameter could named time_cost_wanted to solve this problem for good.)

revision1

Sorry, I might had make a mistake:

Tmax_stw = sum{top max N on-processor time slices}

It should probably be:

Tmax_stw = max{top max N on-processor time slices}

Although I hadn't dive in the go scheduler's code yet, but I guess the 2nd is most likely the right answer, because the 1st algorithm is just too inefficient and unlikely using by the go scheduler.

But no matter which one is the right answer, the 1st one or the 2nd, it couldn't change our conclusion made before utterly.

Example

And there is a example about the usage of go tool trace -diagreedy

$ go tool trace -h
Usage of 'go tool trace':
Given a trace file produced by 'go test':
        go test -trace=trace.out pkg

Open a web browser displaying trace:
        go tool trace [flags] [pkg.test] trace.out

Generate a pprof-like profile from the trace:
    go tool trace -pprof=TYPE [pkg.test] trace.out

[pkg.test] argument is required for traces produced by Go 1.6 and below.
Go 1.7 does not require the binary argument.

Supported profile types are:
    - net: network blocking profile
    - sync: synchronization blocking profile
    - syscall: syscall blocking profile
    - sched: scheduler latency profile

Flags:
        -http=addr: HTTP service address (e.g., ':6060')
        -pprof=type: print a pprof-like profile instead
        -dump: dump all traced out as format text to stdout   <-
        -diagreedy=N: dump the top N greedy goroutines     <- new added options

Sample code to diagnose:

package main

import "os"
import "fmt"
import "runtime"

import "runtime/trace"

//import "sync/atomic"

//import "math/big"
import "time"

func init() {
    //runtime.LockOSThread()
}

func main() {
    //debug.SetGCPercent(-1)
    runtime.GOMAXPROCS(2)

    go func() {
        for {
            fmt.Println("---------------")
            f, err := os.Create("trace_my_" +
                time.Now().Format("2006-01-02-15-04-05") + ".out")
            if err != nil {
                panic(err)
            }
            err = trace.Start(f)
            if err != nil {
                panic(err)
            }
            time.Sleep(time.Second / 1)
            trace.Stop()
            time.Sleep(time.Second * 2)
        }
    }()

    go func() {
        for {
            time.Sleep(time.Millisecond * 10)
            go func() {
                time.Sleep(time.Millisecond * 100)
            }()
        }
    }()

    go func() {
        for {
            fmt.Println("heartbeat... ",
                time.Now().Format("2006-01-02-15:04:05"))
            time.Sleep(time.Second)
        }
    }()

    // maybe highest cpu usage
    go func() {
        l := make([]int, 0, 8)
        l = append(l, 0)
        ct := 0
        for {
            l = append(l, ct)
            ct++
            l = l[1:]
            runtime.Gosched()
        }
    }()

    amount := 1000000
    granul := 100000 
    go godeadloop0(amount, granul)
    go godeadloop1(amount, granul)
    godeadloop2(amount, granul)
    return
}

// potential greediest go functions
func godeadloop0(amount, granul int) {
    func() {
        for {
            ct := 0
            for {
                ct++
                if ct%granul == 0 {
                    runtime.Gosched()
                }
                if ct >= amount {
                    fmt.Println("--->")
                    break
                }
            }
            time.Sleep(time.Second / 10)
        }
    }()
}

func godeadloop1(amount, granul int) {
    func() {
        for {
            ct := 0
            for {
                ct++
                if ct%granul == 0 {
                    runtime.Gosched()
                }
                if ct >= amount {
                    fmt.Println("--->")
                    break
                }
            }
            time.Sleep(time.Second / 10)
        }
    }()
}

func godeadloop2(amount, granul int) {
    func() {
        for {
            ct := 0
            for {
                ct++
                if ct%granul == 0 {
                    runtime.Gosched()
                }
                if ct >= amount {
                    fmt.Println("--->")
                    break
                }
            }
            time.Sleep(time.Second / 10)
        }
    }()
}

Diagnosis for the 1st time:

$ go tool trace -diagreedy=10 /root/go/src/diaggreedy/trace_my_2017-06-25-10-01-13.out
finally the found toppest 10 greedy goroutines info

    7486357ns -> 0.007486s

        GoStart Ts: 739889373 P: 1 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
             GoSched
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
        GoSched Ts: 747375730 P: 1 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95

    3617038ns -> 0.003617s

        GoStartLabel Ts: 649291006 P: 0 G: 27 StkID: 0 args0: 27 args1: 7 args2: 2
             GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
             GC (fractional)
        GoBlock Ts: 652908044 P: 0 G: 27 StkID: 0 args0: 0 args1: 0 args2: 0
             GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435

    2763199ns -> 0.002763s

        GoStart Ts: 925956546 P: 1 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
             GoSleep -> GoUnblock
             4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
             4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
        GoSched Ts: 928719745 P: 1 G: 10 StkID: 21 args0: 0 args1: 0 args2: 0
             4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114

    2573480ns -> 0.002573s

        GoStart Ts: 750086169 P: 0 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
             GoSched
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
        GoSched Ts: 752659649 P: 0 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95

    2558799ns -> 0.002559s

        GoStart Ts: 625418894 P: 0 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
             GoSched
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
        GoSched Ts: 627977693 P: 0 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95

    2462919ns -> 0.002463s

        GoStart Ts: 975882408 P: 0 G: 9 StkID: 0 args0: 9 args1: 97 args2: 0
             GoSched
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
        GoSched Ts: 978345327 P: 0 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95

    2374319ns -> 0.002374s

        GoStart Ts: 240141393 P: 1 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
             GoSleep -> GoUnblock
             4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
             4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
        GoSched Ts: 242515712 P: 1 G: 10 StkID: 21 args0: 0 args1: 0 args2: 0
             4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114

    2250439ns -> 0.002250s

        GoStart Ts: 724664379 P: 0 G: 1 StkID: 0 args0: 1 args1: 0 args2: 0
             GoSched
             4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
             4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
             4744280 main.main /root/go/src/diaggreedy/x.go 74
        GoSched Ts: 726914818 P: 0 G: 1 StkID: 19 args0: 0 args1: 0 args2: 0
             4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
             4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
             4744280 main.main /root/go/src/diaggreedy/x.go 74

    2230559ns -> 0.002231s

        GoStart Ts: 382544542 P: 1 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
             GoSched
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
        GoSched Ts: 384775101 P: 1 G: 9 StkID: 20 args0: 0 args1: 0 args2: 0
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95

    2112279ns -> 0.002112s

        GoStartLabel Ts: 866437288 P: 1 G: 28 StkID: 0 args0: 28 args1: 3 args2: 2
             GoStart -> GoBlock -> GoUnblock
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
             GC (fractional)
        GoBlock Ts: 868549567 P: 1 G: 28 StkID: 0 args0: 0 args1: 0 args2: 0
             GoStart -> GoBlock -> GoUnblock -> GoStartLabel
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435

And then, we could found the source of the long GC pause godeadloop0 and chage the variable granul to value 10000. Test again:

$ go tool trace -diagreedy=10 /root/go/src/diaggreedy/trace_my_2017-06-25-09-52-03.out

finally the found toppest 10 greedy goroutines info

    408760ns -> 0.000409s

        GoStart Ts: 564369612 P: 0 G: 8 StkID: 0 args0: 8 args1: 0 args2: 0
             GoSched
             4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
        GoSched Ts: 564778372 P: 0 G: 8 StkID: 21 args0: 0 args1: 0 args2: 0
             4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66

    393760ns -> 0.000394s

        GoStart Ts: 363796356 P: 1 G: 8 StkID: 0 args0: 8 args1: 0 args2: 0
             GoSched
             4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
        GoSched Ts: 364190116 P: 1 G: 8 StkID: 21 args0: 0 args1: 0 args2: 0
             4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66

    375760ns -> 0.000376s

        GoStartLabel Ts: 583805009 P: 0 G: 16 StkID: 0 args0: 16 args1: 3 args2: 2
             GoStart -> GoBlock -> GoUnblock
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
             GC (fractional)
        GoBlock Ts: 584180769 P: 0 G: 16 StkID: 0 args0: 0 args1: 0 args2: 0
             GoStart -> GoBlock -> GoUnblock -> GoStartLabel
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435

    364040ns -> 0.000364s

        GoStart Ts: 211900654 P: 0 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
             GoSleep -> GoUnblock
             4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
             4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
        GoSched Ts: 212264694 P: 0 G: 10 StkID: 23 args0: 0 args1: 0 args2: 0
             4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114

    358280ns -> 0.000358s

        GoStart Ts: 736130271 P: 1 G: 9 StkID: 0 args0: 9 args1: 0 args2: 0
             GoSched
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95
        GoSched Ts: 736488551 P: 1 G: 9 StkID: 24 args0: 0 args1: 0 args2: 0
             4746045 main.godeadloop0.func1 /root/go/src/diaggreedy/x.go 86
             4744357 main.godeadloop0 /root/go/src/diaggreedy/x.go 95

    357360ns -> 0.000357s

        GoStart Ts: 551030053 P: 0 G: 8 StkID: 0 args0: 8 args1: 0 args2: 0
             GoSched
             4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66
        GoSched Ts: 551387413 P: 0 G: 8 StkID: 21 args0: 0 args1: 0 args2: 0
             4745644 main.main.func4 /root/go/src/diaggreedy/x.go 66

    323360ns -> 0.000323s

        GoStart Ts: 103892587 P: 1 G: 1 StkID: 0 args0: 1 args1: 0 args2: 0
             GoSched
             4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
             4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
             4744280 main.main /root/go/src/diaggreedy/x.go 74
        GoSched Ts: 104215947 P: 1 G: 1 StkID: 22 args0: 0 args1: 0 args2: 0
             4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
             4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
             4744280 main.main /root/go/src/diaggreedy/x.go 74

    320520ns -> 0.000321s

        GoStart Ts: 736132431 P: 0 G: 1 StkID: 0 args0: 1 args1: 0 args2: 0
             GoSched
             4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
             4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
             4744280 main.main /root/go/src/diaggreedy/x.go 74
        GoSched Ts: 736452951 P: 0 G: 1 StkID: 22 args0: 0 args1: 0 args2: 0
             4746621 main.godeadloop2.func1 /root/go/src/diaggreedy/x.go 124
             4744517 main.godeadloop2 /root/go/src/diaggreedy/x.go 133
             4744280 main.main /root/go/src/diaggreedy/x.go 74

    320480ns -> 0.000320s

        GoStartLabel Ts: 777280946 P: 0 G: 16 StkID: 0 args0: 16 args1: 5 args2: 2
             GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435
             GC (fractional)
        GoBlock Ts: 777601426 P: 0 G: 16 StkID: 0 args0: 0 args1: 0 args2: 0
             GoStart -> GoBlock -> GoUnblock -> GoStartLabel -> GoBlock -> GoUnblock -> GoStartLabel
             4285073 runtime.gcBgMarkWorker /usr/local/go/src/runtime/mgc.go 1435

    319480ns -> 0.000319s

        GoStart Ts: 940403207 P: 0 G: 10 StkID: 0 args0: 10 args1: 0 args2: 0
             GoSleep -> GoUnblock
             4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
             4746323 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 112
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114
        GoSched Ts: 940722687 P: 0 G: 10 StkID: 23 args0: 0 args1: 0 args2: 0
             4746333 main.godeadloop1.func1 /root/go/src/diaggreedy/x.go 105
             4744437 main.godeadloop1 /root/go/src/diaggreedy/x.go 114

As you see, the typical GC pause finally changed from 8ms to 0.4ms. Of course the situation would be much more complex in a real big project, but the key idea of the tracking is the same and clear.

Also implemented a 'dump' option like this below:

$ # Ts is the event's timestamp in nanoseconds
$ # by reading the go/src/internal/trace/parser.go to get more info

$ go tool trace -dump /root/go/src/diaggreedy/trace_my_2017-06-25-10-03-15.out 

        0xc42001c120 Off: 24 type: 13 GoCreate Ts: 0 P: 1 G: 0 StkID: 1 args0: 1 args1: 2 args2: 0 link: 0xc42001d050
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c1b0 Off: 30 type: 13 GoCreate Ts: 3879 P: 1 G: 0 StkID: 1 args0: 2 args1: 3 args2: 0 link: 0x0
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c240 Off: 36 type: 31 GoWaiting Ts: 4119 P: 1 G: 2 StkID: 0 args0: 2 args1: 0 args2: 0 link: 0x0
        0xc42001c2d0 Off: 39 type: 13 GoCreate Ts: 8079 P: 1 G: 0 StkID: 1 args0: 3 args1: 4 args2: 0 link: 0x0
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c360 Off: 45 type: 31 GoWaiting Ts: 8199 P: 1 G: 3 StkID: 0 args0: 3 args1: 0 args2: 0 link: 0xc427e29e60
        0xc42001c3f0 Off: 48 type: 13 GoCreate Ts: 13639 P: 1 G: 0 StkID: 1 args0: 4 args1: 5 args2: 0 link: 0x0
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c480 Off: 55 type: 31 GoWaiting Ts: 13759 P: 1 G: 4 StkID: 0 args0: 4 args1: 0 args2: 0 link: 0x0
        0xc42001c510 Off: 58 type: 13 GoCreate Ts: 14159 P: 1 G: 0 StkID: 1 args0: 5 args1: 6 args2: 0 link: 0xc42001cb40
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c5a0 Off: 64 type: 13 GoCreate Ts: 14399 P: 1 G: 0 StkID: 1 args0: 6 args1: 7 args2: 0 link: 0x0
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c630 Off: 70 type: 31 GoWaiting Ts: 14439 P: 1 G: 6 StkID: 0 args0: 6 args1: 0 args2: 0 link: 0xc4223ee2d0
        0xc42001c6c0 Off: 73 type: 13 GoCreate Ts: 18039 P: 1 G: 0 StkID: 1 args0: 7 args1: 8 args2: 0 link: 0x0
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c750 Off: 79 type: 31 GoWaiting Ts: 18119 P: 1 G: 7 StkID: 0 args0: 7 args1: 0 args2: 0 link: 0xc44016d710
        0xc42001c7e0 Off: 82 type: 13 GoCreate Ts: 21479 P: 1 G: 0 StkID: 1 args0: 8 args1: 9 args2: 0 link: 0xc42001d290
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c870 Off: 88 type: 13 GoCreate Ts: 25199 P: 1 G: 0 StkID: 1 args0: 9 args1: 10 args2: 0 link: 0xc42592f440
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c900 Off: 94 type: 13 GoCreate Ts: 29159 P: 1 G: 0 StkID: 1 args0: 10 args1: 11 args2: 0 link: 0xc42001d170
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001c990 Off: 100 type: 13 GoCreate Ts: 33279 P: 1 G: 0 StkID: 1 args0: 17 args1: 12 args2: 0 link: 0x0
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001ca20 Off: 106 type: 32 GoInSyscall Ts: 33359 P: 1 G: 17 StkID: 0 args0: 17 args1: 0 args2: 0 link: 0x0
        0xc42001cab0 Off: 109 type: 5 ProcStart Ts: 33439 P: 1 G: 0 StkID: 0 args0: 3 args1: 0 args2: 0 link: 0x0
        0xc42001cb40 Off: 112 type: 14 GoStart Ts: 33719 P: 1 G: 5 StkID: 6 args0: 5 args1: 0 args2: 0 link: 0xc42001ce10
                 4744545 main.main.func1 /root/go/src/diaggreedy/x.go 22
        0xc42001cbd0 Off: 115 type: 33 HeapAlloc Ts: 36559 P: 1 G: 5 StkID: 0 args0: 253952 args1: 0 args2: 0 link: 0x0
        0xc42001cc60 Off: 120 type: 33 HeapAlloc Ts: 36879 P: 1 G: 5 StkID: 0 args0: 262144 args1: 0 args2: 0 link: 0x0
        0xc42001ccf0 Off: 172 type: 4 Gomaxprocs Ts: 42519 P: 1 G: 5 StkID: 13 args0: 2 args1: 0 args2: 0 link: 0x0
                 4366993 runtime.startTheWorld /usr/local/go/src/runtime/proc.go 951
                 4456480 runtime.StartTrace /usr/local/go/src/runtime/trace.go 256
                 4743634 runtime/trace.Start /usr/local/go/src/runtime/trace/trace.go 23
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001cd80 Off: 177 type: 13 GoCreate Ts: 64239 P: 1 G: 5 StkID: 15 args0: 18 args1: 14 args2: 0 link: 0xc42001cea0
                 4743712 runtime/trace.Start /usr/local/go/src/runtime/trace/trace.go 34
                 4744891 main.main.func1 /root/go/src/diaggreedy/x.go 30
        0xc42001ce10 Off: 184 type: 19 GoSleep Ts: 69599 P: 1 G: 5 StkID: 16 args0: 0 args1: 0 args2: 0 link: 0xc44016e2d0
                 4452857 time.Sleep /usr/local/go/src/runtime/time.go 59
                 4744919 main.main.func1 /root/go/src/diaggreedy/x.go 34
        0xc42001cea0 Off: 188 type: 14 GoStart Ts: 72679 P: 1 G: 18 StkID: 14 args0: 18 args1: 0 args2: 0 link: 0xc42001cfc0
                 4743809 runtime/trace.Start.func1 /usr/local/go/src/runtime/trace/trace.go 26
        0xc42001cf30 Off: 191 type: 28 GoSysCall Ts: 73959 P: 1 G: 18 StkID: 17 args0: 0 args1: 0 args2: 0 link: 0x0
                 4552069 syscall.write /usr/local/go/src/syscall/zsyscall_linux_amd64.go 1064
                 4550393 syscall.Write /usr/local/go/src/syscall/syscall_unix.go 181
                 4591004 os.(*File).write /usr/local/go/src/os/file_unix.go 186
                 4587980 os.(*File).Write /usr/local/go/src/os/file.go 142

        # ......

I would be very glad to push it to golang project and benefits more if this proposal is accepted.

Many thanks for your time and good patience.

hnes commented 7 years ago

If you are interested in this little tool, there is the repository.

hnes commented 7 years ago

Sorry, I might had make a mistake:

Tmax_stw = sum{top max N on-processor time slices}

It should probably be:

Tmax_stw = max{top max N on-processor time slices}

Although I hadn't dive in the go scheduler's code yet, but I guess the 2nd is most likely the right answer, because the 1st algorithm is just too inefficient and unlikely using by the go scheduler.

But no matter which one is the right answer, the 1st one or the 2nd, it couldn't change our conclusion made before utterly.

rsc commented 7 years ago

/cc @aclements @RLH

rsc commented 7 years ago

/cc @dvyukov

hnes commented 7 years ago

Thanks a lot, rsc :)

RLH commented 7 years ago

Nice tool.

Moving forward the Go compiler needs to be taught how to preempt tight loops so, while valuable, the usefulness of this tool is short term. The unfortunate downside is that once the tool is in the source base it would need to be supported even though its usefulness will be minimal once the compiler adds loop preemption.

On Mon, Jun 26, 2017 at 3:53 PM, Russ Cox notifications@github.com wrote:

/cc @aclements https://github.com/aclements @RLH https://github.com/rlh

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/20792#issuecomment-311163861, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn34dy3_FVSqGspt0jSod-__54uniks5sIAxKgaJpZM4OEhc5 .

hnes commented 7 years ago

Thanks Richard for yo­ur reply :)

Honestly, I don't thi­nk this tool would be­come useless even aft­er our compiler adds a­ 'mature default-on or even always-on loop preempt­ion option'. And there are s­everal reasons.

The approach of loo­p preemption cannot ­eliminate all cpu gre­edy problems, there a­re just too many diff­icult and sticky corn­er cases in cpu greed­y problems that loop­ preemption cannot h­andle them all efficiently right now (or m­aybe would pay a very­ dear price to achieve it, list them all, ­and fix them all efficiently? It'­s just too expensive ­even you could list all t­he corner cases).

Even when the loop p­reemption is confide­nt about itself that ­it can eliminate all ­cpu greedy problems a­nd all the difficult ­and sticky corner cas­es, and after the option l­oop preemption is al­ways enabled in the g­o compiler, shouldn't ­we still need a ­tool for the users to­ test out of the box that wether the­ go binary is working­ as the compiler prom­ised ( which is 'all cpu greedy­ problems is solved b­y option loop preemp­tion') and as they intended to? And then, if they still found some cpu greedy hot path, they could optimize it by themself if they want to or report it as a issue, isn't be even better?

Personally view, a­lthough loop preempt­ion is a pretty good­ feature but it shoul­dn't be always 'on' o­r even be default 'on­', it should be defau­lt 'off' even when it­ is 'mature' (and to ­what extent you could­ believe it is mature­?). And because 'loop­ preemption' is just ­a little 'anti-common­-sense' and radical, ­it should not be even ­a global option. It s­hould be a option whi­ch has a granularity ­of every single sourc­e files (I do not kno­w whether it is or no­t now already).

Go already has tons ­of open source projec­ts and libraries whic­h they only have the ­assumption of 'preemp­t only at function calls' ­at that time when the­y were developed. And ther­e maybe something wou­ld turn wrong when th­e compiler performs t­he 'loop preemption t­ransformation' for al­l of them. It is a little dangerous and risky as many go applications plays a very important role in their systems.

With the ­assumption of 'preemp­t only at function calls', users can do some high priority batching jobs without been blocked by scheduler in a very high speed and low latency manner. But with the option 'loop preemp­tion' is 'on', such jobs will become less efficient and the latency will be worse.

So, the trade-of­f of the overhead of 'loop preemp­tion' should be chose­n and decided by the ­users themself: have ­it or not. (As far as I know, the current implementation of 'insert_resched_checks' is 'The simpler the loop, the worse the overhead, that is why this phase is not normally enabled. -- from David Chase', and it maybe not a quite easy job at all to bring a efficient and complete 'loop preempt' function to our users in a short time, one year? for instance.)

So above all, I think it would be better to provide a diagnose tool for the users to see the status of their goroutines' cpu time slices and optimize their codes by themself if they really want to, no matter the 'loo­p preemption' is default 'off' now or may default 'on' in the future. Still, it maybe often better to provide more choices to our users then to limit them to a single one.

aclements commented 7 years ago

The approach of loo­p preemption cannot ­eliminate all cpu gre­edy problems, there a­re just too many diff­icult and sticky corn­er cases in cpu greed­y problems that loop­ preemption cannot h­andle them all efficiently right now (or m­aybe would pay a very­ dear price to achieve it, list them all, ­and fix them all efficiently? It'­s just too expensive ­even you could list all t­he corner cases).

I'm not sure I understand your claim, or perhaps I don't understand your definition of "cpu greedy problems". Can you give an example of such a problem that loop preemption doesn't solve?

I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements.

hnes commented 7 years ago

Hi Austin, I think you are actually underestimating the distance and the difficulty of the gap between 'local optimization' and 'perfect and optimal solution' of the "cpu greedy problems" in golang.

The definition of "cpu greedy problems" is:

A cpu greedy goroutine is a goroutine which would very rarely do the things that may let he himself been scheduled away from the cpu he had already been occupied. The frequency of the word 'rarely' here is the degree that could let the others starve a little or even to death.

(this thread had discussed such influence)

The 'cpu greediest goroutine' example maybe something like this:

go func(){
    for{}
}()

But the 'loop' is only one kind big ordinary case of such problems, how about this one:


_adder := func(c *uint32, depth uint32) uint32 {
    atomic.CompareAndSwapUint32(c, 0, 333)
    atomic.CompareAndSwapUint32(c, 0, 333)
    atomic.CompareAndSwapUint32(c, 0, 333)
    // ....
    /* omitted, because there are */
    /* 100,000 of them :(  */
    atomic.CompareAndSwapUint32(c, 0, 333)
    atomic.CompareAndSwapUint32(c, 0, 333)
    atomic.CompareAndSwapUint32(c, 0, 333)
    return atomic.LoadUint32(c)
}

godeadloop := func() {
    var ct uint32
    for {
        ct++
        ct = _adder(&ct, 1)
        ct = _adder(&ct, 1)
        ct = _adder(&ct, 1)
        // ....
        /* omitted, because there are */
        /* 100,000 of them :(  */
        ct = _adder(&ct, 1)
        ct = _adder(&ct, 1)
        ct = _adder(&ct, 1)
    }
}

go godeadloop()

intact version source code is in here

In this example there is no loop anyway right? But it is pretty 'cpu greedy' and this kind of piece code could let other goroutines to starve or even to death (whole go process would stuck for several minutes probably I think if the STW happened together).

Of course that you could solve this problem by a new 'insert_more_resched_checks' to the go AST after our discussion, but this is not the point, the point is the 'loop preempt' solution we currently use for the "cpu greedy problems" already had neglect this one corner case by mistakenly, how could we know there is not some more corner cases just hidden there that we didn't find them yet?

If you really think the 'loop preempt' solution could be the 'perfect and optimal solution', then could you write a formal and strict mathematical proof for it then? If you couldn't, we can only use it as a suboptimal solution. If we use a 'suboptimal solution' as a 'perfect optimal solution', terrible things may happen.

So, this is why I still insist on to add some diagnosis tool like 'diagreedy' to golang, you could use it as a verification tool or whatever, just give some more information to our users and let them know more about what is happening deep inside the go processes, it is also a win-win solution not only for the users, but also for the developers.

hnes commented 7 years ago

I also wouldn't call the cost "very dear". Our current prototype of loop preemption shows a ~2.5% geomean slow-down across the go1 benchmarks, which are CPU-intensive and have several tight loops. Each release generally speeds up code by more than that just through code generation improvements.

If you have a large computer cluster which worth 100,000,000 dollars, then the improvement of 2.5% on the performance could save you 2,500,000 dollars. And 2,500,000 dollars could do many valuable things :D

hnes commented 7 years ago

For some people who prefer performance, please give them other choices, do not close the doors to them, all they need to do is just to analyze the trace.out the go process have dumped on the fly.

"Loop preempt" is a good tool, I think a good usage of it is to enable the "loop preempt" on some opensource project that we don't trust yet and disable them on our own reliable code to have a better performance and controllability.

( Sorry, I might have a impression that you may want to do anything for the users, but that would be some kind of over optimization if you do too much of it :p )

josharian commented 7 years ago

The compiler has already been taught to identify problematic loops. I'd suggest that a good next step would be to print out that information when requested, perhaps with -m or -m -m or with a new flag (regardless of whether the code generation will be altered). That is likely to be helpful long term anyway, and it is very likely a small, unobtrusive change. If it is small and safe enough, maybe (but just maybe) it could even be snuck into 1.9.

cc @dr2chase @bradfitz

hnes commented 7 years ago

A very good idea, and I agree with you 👍

The more tools and options to diagnose the possibly problematic code the better.

(But one problem here is that maybe nearly all 'for loop' uses in the sources will be reported as 'problematic'. Although I don't know the details implementation of the current 'loop preempt' in the compiler. But the optimization is often one of the most difficult parts in compiler science ( because compiler can not read the mind of programers). And the optimization nearly all is 'suboptimal solution' and there is hardly any 'perfect optimization'. This why I still also insist on to provide some more external diagnosis tools to help programmers to improve their code if they want. )

hnes commented 7 years ago

I also thought that some code had been wrote with go assembly language like crc32 would be very cpu intensive and has some 'loops' that insert_resched_checks maybe not handled yet at current implementation. (Not quite sure about it. Still, the preemption in go assembly codes may break the semantic of go assembler.)

hnes commented 7 years ago

/cc @rsc @aclements @RLH @josharian

It has been nearly 10 days since our first discussion. Maybe we should make a conclusion about it?

josharian commented 7 years ago

@hnes proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made.

hnes commented 7 years ago

Thank Josh for letting me know :)

proposals are reviewed on a schedule. And 10 days is not very long; many decisions take much longer to be made.

It is pretty reasonable.

rsc commented 7 years ago

Certainly loop preemption in Go 1.10 should make this less necessary, but as noted not completely unnecessary. In addition to straight line code there is also the question of large memmoves, which are in non-preemptible assembly, and maybe other operations like those.

The question is how to expose this in the tool. "go tool trace" is really about the web view. This new flag turns it into a command-line program with a new output format. Can we instead somehow highlight "greedy" goroutines in the web view? Then no new option is needed. Or maybe it should be a third-party separate tool anyway?

@dvyukov, as owner of go tool trace, any thoughts?

dvyukov commented 7 years ago

Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long).

hnes commented 7 years ago

Perhaps if we trace when we requested STW/foreachP and when we actually get STW/foreachP, it will give enough info in the web view to understand the bad cases (what's running at the time of the request for for how long).

But the sampling could be imprecise and still "need some good luck" for users to find the real top N greedy goroutines.

Traversing the whole events[] list and use the min-heap sort algorithm to find the precise Top-N greedy goroutines, this kind of approach's worst time complexity is only about O(n*lgn) (it's much faster than parsing the trace.out binary file in practical).

dvyukov commented 7 years ago

How do you propose to detect non-cooperative goroutines? The fact that it runs on CPU for a long time does not mean anything, it still can be perfectly cooperative if/when we try to preempt it.

You understand that this is a wrong way to write Go code, right?

    do_cpu_staff0() -- time cost 200us
    runtime.Gosched()
    do_cpu_staff1() -- time cost 200us
    runtime.Gosched()
    do_cpu_staff2() -- time cost 200us
    runtime.Gosched()

Goroutines are preemptive since Go 1.1 and soon they become even more preemptive.

hnes commented 7 years ago

How do you propose to detect non-cooperative goroutines?

You are right and I quite agree with you.

The STW/foreachP should record the information too when it found some non-cooperative goroutines who had take too long (say more than 1ms or could be configured by user) to yield the cpu.

The fact that it runs on CPU for a long time does not mean anything

The more goroutines are cpu greedy, the worst the overall latency the application would be.

it still can be perfectly cooperative if/when we try to preempt it.

Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is, and this is why we may still need the diagnosis result of Top N greedy gorountines and the precise source line numbers (as we already discussed in this thread, the the for loop preemptive approach cannot perfectly solves all this kind of problems, think about assembly for one instance.)

You understand that this is a wrong way to write Go code, right? do_cpu_staff0() -- time cost 200us runtime.Gosched() do_cpu_staff1() -- time cost 200us runtime.Gosched() do_cpu_staff2() -- time cost 200us runtime.Gosched() Goroutines are preemptive since Go 1.1 and soon they become even more preemptive.

May have a look about this although it is some kinda hacky but it still could mean something. This piece code in go 1.8.3 the STW will stuck nearly forever.

The preemptive is just relatively.

So, how about let we combine both of them?

  1. The STW/foreachP records the information when it found some non-cooperative goroutines who takes too long (say more than 1ms or could be configured by user) to yield the cpu when STW/foreachP wants to stop the world (May even dumps some warning log to stout or to log file when application is running? That would be very useful in the production environment).

  2. The diagnosis of Top N greedy gorountines we already discussed previously.

dvyukov commented 7 years ago

The more goroutines are cpu greedy, the worst the overall latency the application would be.

No. What matters is only if a goroutine is not cooperative.

Yes, but before you chose to "preempt" it, you probably have to know where the problem exactly is

No. Runtime preempts all currently running goorutines.

hnes commented 7 years ago

No. What matters is only if a goroutine is not cooperative.

Agree with you. Maybe this is the literal ambiguity, the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).

No. Runtime preempts all currently running goorutines.

The more precise is "Go Runtime always try to preempts all currently running gorutines." But if the goroutines do not give the runtime any single chance, the runtime would never preempt even one goroutines successfully, and that is why we want this useful diagnosis tool to optimize our codes :)

dvyukov commented 7 years ago

the phrase "cpu greedy goroutines" I used here does imply that it's less cooperative or completely no cooperative (a example is the go function godeadloop0 defined in this proposal code example).

So, how do you propose to detect these 'completely non-cooperative' goroutines?

hnes commented 7 years ago

So, how do you propose to detect these 'completely non-cooperative' goroutines?

In fact I had think about this before, and here is maybe a appropriate solution:

             t1=12           t2=11           t3=10
      |                |              |               | 
      __________________________________________________
      |                |              |                 |
      ---------------------------------------------------> time line 
P0    [------go 1------][-go 7-][-----------------][-go 6-][-go 4-]
P1    [-go 3-][-go 4-][------------go 2-----------][-go 1-][-go 2-]
P2    [-go 4-][-go 3-][-go 8-][-------------------][-----go 10-----]
                        |
                        |---->---->---->---->---->|
                        |                         |
                       STW start               release
STW:
    timeout := 1ms
    max_total_time := 2s
    total_time := 0
    set try_to_stw_flag
    for _,p = range [P0,P1,P2] {
        locked,timecost := mutex_lock_timeout(p,timeout)
        total_time+=timecost
        for locked is nil {
            scan [P0,P1,P2]'s currently running user goroutines' info
            dump it out // to trace struct || some log fd
            if total_time >= max_total_time{
                panic("stw take too long")
            }
            locked,timecost = mutex_lock_timeout(p,timeout)
            total_time+=timecost
        }
    }   
hnes commented 7 years ago

A little correction.

STW:
    timeout := 1ms
    max_total_time := 2s
    total_time := 0
    // ++ do not allocate the idle Processors to ready goroutines anymore
    set try_to_stw_flag
    for _,p = range [P0,P1,P2] {
        locked,timecost := mutex_lock_timeout(p,timeout)
        total_time+=timecost
        for locked is nil {
            // ++ may be it is hard to get the details( include func call stack) 
            // ++ of a running goroutine dynamically?
            scan [P0,P1,P2]'s currently running user goroutines' info
            dump it out // to trace struct || some log fd
            if total_time >= max_total_time{
                panic("stw take too long")
            }
            locked,timecost = mutex_lock_timeout(p,timeout)
            total_time+=timecost
        }
    }   
    // ++
    unset try_to_stw_flag
    // stw successfully
dvyukov commented 7 years ago

It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request).

hnes commented 7 years ago

It seems that we just need to trace when we requested STW/foreachP and when we actually get STW/foreachP. The rest of info is already in the trace and can be inferred in the trace command (namely, what goroutines were running at the time of the request).

Bravo! And it would be much convinient to implement our new tool :D

rsc commented 7 years ago

OK, so it sounds like we should add that trace info to the trace dump file. But I still don't see what we're going to do to expose it, unless the STW is a different clearly marked bar already. What will it look like? A special command-line mode people need to know about is not discoverable enough.

aclements commented 7 years ago

We currently record mark termination STW, but somehow forgot about sweep termination STW. I've got a CL to add this for computing MMUs, but it would help with this analysis, too.

gopherbot commented 7 years ago

Change https://golang.org/cl/55411 mentions this issue: runtime,cmd/trace: trace GC STW events

rsc commented 7 years ago

OK, so after CL 55411, @aclements, is there anything else for this issue? Should we at least mark it proposal-accepted?

aclements commented 7 years ago

OK, so after CL 55411, @aclements, is there anything else for this issue?

In terms of infrastructure support, I believe CL 55411 is sufficient.

Should we at least mark it proposal-accepted?

I certainly hope we can make this analysis simply unnecessary, but it could be useful in the interim and in flushing out bugs. Presentation questions aside, it should be relatively straightforward to implement a robust version on top of the events added by CL 55411.

rsc commented 7 years ago

Marking Proposal-Hold until there's a clearer explanation of what the exposed UI is.