golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.02k stars 17.54k forks source link

runtime: mark assist blocks GC microbenchmark for 7ms #27732

Open dr2chase opened 6 years ago

dr2chase commented 6 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

1.11

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

OSX (also observed on Linux), both AMD64.

What did you do?

This singlethreaded microbenchmark measures GC latency to allocate a byte slice and store it in a big circular buffer, repeating the operation 5 times the size of the big circular buffer (i.e. one initialization and four reuses). Live memory is about 210MB, in the form of the circular buffer (200,000 slice elements) and 200,000 pointer-free buffers of size 1k.

This version of the benchmark is instrumented to collect a trace, because that's how we figured out that it was mark assist.

The original benchmark source is https://github.com/WillSewell/gc-latency-experiment, adapted here to be more instrumented and more plain-spoken about what it is doing.

package main

import (
    "fmt"
    "os"
    "runtime/trace"
    "time"
    "unsafe"
)

const (
    bufferLen = 200000
    msgCount  = 1000000
)

type kbyte []byte
type circularBuffer [bufferLen]kbyte

var worst time.Duration

// For making sense of the bad outcome.
var total time.Duration
var worstIndex int
var allStart time.Time
var worstElapsed time.Duration

// newSlice allocates a 1k slice of bytes and initializes them all to byte(n)
func newSlice(n int) kbyte {
    m := make(kbyte, 1024)
    for i := range m {
        m[i] = byte(n)
    }
    return m
}

// storeSlice stores a newly allocated 1k slice of bytes at circularBuffer[count%bufferLen]
// (bufferLen is the length of the array circularBuffer)
// It also checks the time needed to do this and records the worst case.
func storeSlice(c *circularBuffer, highID int) {
    start := time.Now()
    c[highID%bufferLen] = newSlice(highID)
    elapsed := time.Since(start)

    candElapsed := time.Since(allStart) // Record location of worst in trace
    if elapsed > worst {
        worst = elapsed
        worstIndex = highID
        worstElapsed = candElapsed
    }
    total = time.Duration(total.Nanoseconds() + elapsed.Nanoseconds())
}

func main() {
    trace.Start(os.Stderr)
    allStart = time.Now()
    defer trace.Stop()
    var c circularBuffer
    for i := 0; i < msgCount; i++ {
        storeSlice(&c, i)
    }
    fmt.Println("Worst push latency: ", worst)
    fmt.Println("Worst push index: ", worstIndex)
    fmt.Println("Worst push occurs at run elapsed time: ", worstElapsed)
    fmt.Println("Average push latency: ", time.Duration(total.Nanoseconds()/msgCount))
    fmt.Println("Sizeof(circularBuffer) = ", unsafe.Sizeof(c))
    fmt.Println("Approximate live memory = ", unsafe.Sizeof(c)+bufferLen*1024)
}

What did you expect to see?

I expected to see a sub-millisecond worst-case latency; the average time without GC to initialize a new slice is less about a microsecond (on a 2017 Mac laptop).

What did you see instead?

Worst-case latencies on the order of 4-10ms.

I'v attached the trace file for the following run:

go run hello.go 2> trace2.out
Worst push latency:  5.193319ms
Worst push index:  780507
Worst push occurs at run elapsed time:  995.334915ms
Average push latency:  1.018µs
Sizeof(circularBuffer) =  4800000
Approximate live memory =  209600000

The worst case latency ends at 995ms, corresponding to a single 5ms mark assist. A zoom of the trace displaying this is also attached.

trace2.out.gz

screen shot 2018-09-18 at 11 17 12 am

thepudds commented 6 years ago

16528 (now closed) I think might have been cited in discussion of a previous incarnation of this GC benchmark.

Also probably related is this prior blog post: https://making.pusher.com/golangs-real-time-gc-in-theory-and-practice/ (where that blog references #16528 as an unfixed cause of the behavior seen in that blog).

And also cited in that blog is this mailing list conversation from 2016: https://groups.google.com/d/msg/golang-nuts/nOD0fGmRp_g/b_FWITKtBQAJ

thepudds commented 6 years ago

Expanding my comment slightly:

The benchmark code appearing within the text of that 2016 blog: https://making.pusher.com/golangs-real-time-gc-in-theory-and-practice/

seems to match the benchmark code that @dr2chase says he used as his starting point above: https://github.com/WillSewell/gc-latency-experiment

Both are from the same original author.

That 2016 blog reported ~7 ms pause time for Go 1.7.3, and at the time #16528 was theorized as the root cause as why the results weren't better (e.g., see the "Why Are the Go Results Not Better?" section there).

In any event, mainly just wanted to mention the older investigation in case that is useful (and apologies if not useful).

dr2chase commented 6 years ago

A bit more info: it turns out that in the original, the 4800000 byte circular buffer was allocated on the stack, and large stack frames are not handled incrementally in the same way that large objects are.

Modifying the benchmark to allocate the circular buffer and store the pointer in a global, the latency falls to 2ms, which is better, though still far worse than expected. In the snapshot from the trace, you can see that the GC work is now shared among several threads, but the worker thread is 100% running mark assist during that interval.

screen shot 2018-09-19 at 9 57 31 am

A different modification, to move the declaration of var c circularBuffer to a global, also shortens the worst case latency in the same way, also with the same 100% mark assist for about 2ms.

screen shot 2018-09-19 at 10 11 37 am

Still to investigate:

  1. is it really policy that the mutator should get hammered like that?
  2. why isn't GC starting earlier to avoid the need for mark assist? That's supposed to happen for stable allocation rates, which this is.

Once mark assist is dealt with, this microbenchmark is likely to have problems with long sweeps, a different, known bug ( #18155 ) that I've also seen in some of these traces. That looks like:

screen shot 2018-09-19 at 10 29 46 am

dr2chase commented 6 years ago

New summary of apparent subissues:

  1. Stack allocations are not divisible work; can take a long time to process.
  2. Mark assist applied to global and stack allocations (generally, roots) doesn't generate any credit.
  3. Sweeping problem (#18155)
  4. Reschedule (i.e., quantum expires) that discovers a need for GC does not wake other Ps (or, does not check global run queue) (i.e., the work to do expanded by 1, if any Ps are idle, they one should be wakened to take over doing "real work").
  5. (Related to 4?) Dedicated worker doesn't kick goroutines out of its local run queue.

Remaining mystery: For a stable allocation rate, GC should start soon enough that mark assist is not required. Why doesn't it?

For reliable measurement of steady-state latencies, benchmark ought to do a warmup run first, because rapid heap growth around startup is more likely to provoke mysterious OS interference with progress.

gopherbot commented 6 years ago

Change https://golang.org/cl/136555 mentions this issue: runtime: experimental GC-related changes to scheduling

dr2chase commented 6 years ago

Lack of credit for mark assist of roots is a contributor to long pauses. Experiment where root assist halves the distance to completion does much better at pause times (when other delays are addressed either using above CL's hacks or unspeakable other hacks), even down to GOMAXPROCS=2.

gopherbot commented 5 years ago

Change https://golang.org/cl/146817 mentions this issue: runtime: look for idle p to run current thread when quantum expires into GC

dr2chase commented 5 years ago

Turns out it is in fact policy to allocate up to 10M-sized things on the stack, if they are plain vars and not explicit allocations. See gc.maxStackVarSize in gc/go.go.

LK4D4 commented 5 years ago

I didn't find better fit issue to describe my problem, please let me know if I should open another issue. Basically we're migrating to go1.12 from go1.10 now and I noticed that tail latencies (p95 and p99) went sharply up (200 -> 250ms). This service had GC problems before, so that's what I suspected and collected trace. What I found there is that some goroutines blocked for up to 25ms by MARK ASSIST (unfinished) and some of them forced to assist 5-6 times in a row if they're successful. I understand that it might be not the source of our latency jump, but I didn't find anything else. Here is trace part I was looking at (proc 8 in particular) image In go1.10 we had similar problem with SWEEP STW (https://github.com/golang/go/issues/18155#issuecomment-416328575) when we had a lot of same-sized objects in a heap which we "fixed" with moving most of the objects to sync.Pool. Please let me know if I can provide more info and I will appreciate any suggestions about how to mitigate that issue because I wouldn't like to halt our migration :( Edit: I forgot to mention that we use go1.12.5

LK4D4 commented 5 years ago

@dr2chase @aclements please let me know if I could help with debug or if you have any ideas what allocation patterns could cause that.

dr2chase commented 5 years ago

Sorry not to reply earlier, I haven't looked at this for a little while. In terms of bugs-we've-got, that matches the above subproblems

I don't know of an automated way to check for that, but if you think that's happening, you can force things to the heap by assigning their address to a global var sink interface{} and then niling the global later. Maybe we need to put together a GOEXPERIMENT that is less excited about stack-allocating big things that contain pointers, so you can more easily determine if this is the problem.

(I just looked at the "Run Vitess Locally" instructions, that's a lot, I would probably need that GOEXPERIMENT anyway myself. How do you feel about lighting this up with 1.13 for testing purposes?)

LK4D4 commented 5 years ago

@dr2chase our project isn't exactly vitess, it's using only go/mysql part of the vitess, but it has a lot of other stuff. I can try 1.13 to see if that helps. If it isn't that easy to avoid that problem - we probably will wait until 1.13 anyway, so could start early. Thanks! /cc @sougou as it might affect vitess

gopherbot commented 5 years ago

Change https://golang.org/cl/180817 mentions this issue: cmd/compile: add -smallstacks gc glag for GC latency diagnosis

dr2chase commented 5 years ago

@LK4D4 I just submitted a change to 1.13 tip that might help you diagnose this (depending on the complexity of your build process). Crudely, go build -gcflags 'all=-smallframes' hello.go will force larger things to be allocated on the heap instead of stack, which works around two flaws in allocating and accounting for mark assist work (flaw 1, stack scanning is a large indivisible chunk of work, flaw 2, mark assistant gets zero credit for doing stack scanning). If this makes your latency get better, next problem is to figure out which stack frames cause the problem. When you know that, you can disable stack allocation by leaking the address of the giant object (store it to a global, later nil the global -- or call a global pointer to a no-op function passing it the address, that is friendlier to the race detector).

You can look for differences by comparing output of -gcflags 'all=-m -smallframes' with -gcflags 'all=-m'. That will be horribly verbose, though the differences should not. Anything appearing in the differences is a somewhat-large allocation.

Austin's unproven-but-plausible theory for this is that between 1.10 and 1.12 one of the incremental improvements in escape analysis let something large be allocated on the stack.

gopherbot commented 5 years ago

Change https://golang.org/cl/180958 mentions this issue: cmd/compile: correct capitalization in recordFlags parameter

dr2chase commented 5 years ago

Further info: besides incremental changes to escape analysis, incremental changes to inlining could also cause new things to not escape and become stack-allocated.

I also did a quick round of diffs building the compiler this way and building a collection of benchmarks this way, and large frames are exceedingly rare. This might motivate some other approach to the indivisible-no-credit stack-scanning problem, that is, make small frames the default, and when large objects are stack eligible, the compiler will explicitly heap allocate and deallocate them instead.

gopherbot commented 2 years ago

Change https://golang.org/cl/372256 mentions this issue: benchmarks/gc_latency: add new microbenchmark for tricky mutator latency issues

gopherbot commented 2 years ago

Change https://go.dev/cl/431877 mentions this issue: runtime: export total GC Assist ns in MemStats and GCStats