golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.88k stars 17.65k forks source link

runtime: bad performance of core GO features (GC, go routines, channels, map) #32195

Closed StephanVerbeeck closed 5 years ago

StephanVerbeeck commented 5 years ago

We measured the impact of GC related to the amount of persistent memory objects. Existing interactive accounting program used 44K to 90K of persistent heap objects. Next multiplatform distributed VR/AR program will be using at least 1000K persistent heap objects. Tested by adding extra functions BurnCPU() on the background that allocates small data structures and keeps less than 1% of the created structures persistent.

Environment

Expected

Observed (After adding "go BurnCPU()" as first line of the main() function)

Benchmark code:

code:

func main() {
    go BurnCPU()        
    // <- code of existing GUI application remains here
}

func BurnCPU() {
    type aha struct{ a, b int }
    oho := map[int]*aha{}
    com := make(chan *aha, 100)
    wg := sync.WaitGroup{}
    go func() {
        for {
            c := <-com
            oho[c.a+c.b] = c
        }
        return
    }()
    for a := 1; a < 1e5; a++ {
        wg.Add(1)
        go func(a int) {
            for b := 1; b < 1e5; b++ {
                b += 1234
                c := new(aha) // allocate some memory to cause GC
                c.a = 10 * (a)
                c.b = 10 * (b - 1234)
                b -= 1234
                if b%1000 == 0 {
                    com <- c
                    runtime.Gosched()
                }
            }
            wg.Done()
            return
        }(a)
        runtime.Gosched()
    }
    wg.Wait()
    memStats := runtime.MemStats{}
    runtime.ReadMemStats(&memStats)
    return // <- set debug break here and examine memStats
}

Conclusions:

Variants:

Assumptions based on these observations:

mvdan commented 5 years ago

Just want to point out that "abominable" is not a good way to start a discussion about performance. Constructive issues get attention and often positive outcomes, but this doesn't seem constructive to me.

Also, it would be much better to raise issues about specific performance problems you've encountered, and reliable ways to reproduce them. This issue is quite massive, so it would be hard to stay on topic and tackle any specific problems separately.

StephanVerbeeck commented 5 years ago

I have done some deeper digging and (although yet speculative) might have found a reason for the bad performance that can not be found in software. It might be a pure hardware problem. GO "claims" to have no problems with numbers of go routines that far exceeds the number of available CPU cores.

To a degree that is accurate and since go routines are "distributed" over a fixed number of worker threads (about as many as there are CPU cores). HOWEVER . . . there is still an excessive amount of context switching in such a worker thread and each time such a switch occurs the data which is in the L1/L2/L3 cache of the CPU core that runs the worker thread becomes "useless". So the excessive amount of context switching makes that the CPU cache can not do its normal function.

With normal OS round-robin scheduling the switches between processes is determined by the OS and is far less (in the order of e.g. 50x per second) so that the cache is only useless every 20msec or so. GO has broken this rule and can have (depending on the application) 1000x more context switches per worker thread per second.

I have tried to test this hypothesis by adding...

    for runtime.NumGoroutine() > runtime.NumCPU() {
        time.Sleep(1e6)
    }

... to the loop that launches the go routines. So the BurnCPU() now looks like this:

    type aha struct{ a, b int }
    oho := map[int]*aha{}
    com := make(chan *aha, 100)
    wg := sync.WaitGroup{}
    go func() {
        for {
            c := <-com
            oho[c.a+c.b] = c
        }
        return
    }()
    outer := int(1e3)
    inner := int(1e7)
    wg.Add(outer)
    for a := 1; a <= outer; a++ {
        go func(a int) {
            for b := 1; b <= inner; b++ {
                b += 1234
                c := new(aha) // allocate some memory to cause GC
                c.a = 10 * (a)
                c.b = 10 * (b - 1234)
                b -= 1234
                if b%1000 == 0 {
                    com <- c
                }
            }
            wg.Done()
        }(a)
        for runtime.NumGoroutine() > runtime.NumCPU() {
            time.Sleep(1e6)
        }
    }
    wg.Wait()
}

The results now look better and the blocking of the interactive part of the program is gone. So this is good BUT there is no method of knowing how many of the runtime.NumGoroutine() are blocked (e.g. waiting for channel input) so this code is a potential DEADLOCK. So to make this solution work properly "runtime.NumGoroutine()" would have to be "runtime.NumUnblockedGoroutine()" (returning the number of go routines that are active plus the number of go routines that are pending to become active, so not counting go routines that wait for an IO/timer event).

Ideal solution would be that this kind of logic would be part of the "go" command itself. To avoid backward compatibility issues maybe add a boolean flag that enables this via a call to e.g. "runtime.SetIdleGO(state bool) bool".

So the basic idea is that any "go" would block the routine that tries to do it until the number of already active routines becomes less than the number of available worker threads. How this influences existing code and benchmarks has to be investigated first of course.

Another alternative could be to add a function "WaitGroup.Go(func)" that does this checking only based on the number of routines in the WorkGroup. Then the WorkGroup.Add(1) is done by WaitGroup.Go(func) and you would only have to call WorkGroup.Done() at the end of the started go routine.

ALTree commented 5 years ago

The implementation of your BurnCPU function doesn't make sense to me. Why are you spawning 10000 goroutines? In an hypothetical C implementation of BurnCPU, would you spawn 10000 OS threads? No, obviously... that would just completely trash the OS scheduler without helping you actually burning CPU in any way. But in your Go code, you're doing exactly that.

GO "claims" to have no problems with numbers of go routines that far exceeds the number of available CPU cores.

This is only true when the goroutines are spawned to execute light-weigth tasks, or tasks that block. In every other case, it makes no sense to spawn more goroutines than cores you have on your machine. Experienced Go developers are well aware of this.

The number of goroutines that are tasked to execute heavy workloads (like "burning" CPU in your test) needs to be limited to the amount of cores on your machine; if anything just for the reason that anyway you cannot do more CPU-heavy concurrent work than the number of cores on your machine.

Your BurnCPU function needs to be changed so that it only spawns an amount of goroutines that is exactly equal to the amount of threads that can be executed concurrently on your machine (e.g.: 4 or 8). Otherwise you're just trashing the goroutines scheduler for no reason. Then the goroutines that are in charge of the GUI get trapped in the 10000 goroutines scheduler-hell you created, and the UI won't be responsive.

In general, from the code and the message you posted, it appears that you have a very superficial understanding on how Go works and, more importantly, how Go primitives like goroutines should be used.

I encourage you to gain more experience in writing idiomatic, high-performance Go code before you trust yourself with an opinion about whether the language is up to a specific task you have to accomplish. I'm not saying Go is definitely the right choice for your application: maybe it's not.

networkimprov commented 5 years ago

Preemptive goroutine scheduling is in progress, probably landing in 1.14, see #24543.

Also, channels are not the fastest possible way to synchronize goroutines. The sync and sync/atomic packages are more efficient, if harder to get right. "Sharing memory by communicating" only takes you so far :-)

And if I may, I'd encourage everyone to be more gentle in tone.

StephanVerbeeck commented 5 years ago

Why are you spawning 10000 goroutines? In an hypothetical C implementation of BurnCPU, would you spawn 10000 OS threads? No, obviously... that would just completely trash the OS scheduler without helping you actually burning CPU in any way. But in your Go code, you're doing exactly that.

The constructed benchmark is in scale and load equivalent to the target application which maintains a distributed AR scene with persistent shape count in the range 1 to 100 million objects, each having their own local methods that can use go routines.

The abilities of C++ both on CPU and GPU are in that performance range. The purpose of a benchmark is also to see how the system scales up so why would we do a small scale test?

ALTree commented 5 years ago

1 to 100 million objects, each having their own local methods that can use go routines.

It's hard to make sensible comments about the kind of high-level architecture an application of this level of complexity should have, but as as a general observation, if those millions of objects could potentially fire-up CPU-intensive tasks, each on in their own goroutines, concurrently, I don't think this kind of architecture can and will work, at all, for the reasons I have stated above.

So basically it appears that the main issue you have here is that your Go CPUBurner prototype is modelling an architecture that is fundamentally not suitable for the kind of application you'd like to build. My impression is that the fact that Go provides certain features (like goroutines and channels) at a language-level has misled you into producing a design that is not suitable for your needs, and more importantly:

The abilities of C++ both on CPU and GPU are in that performance range.

the design of your Go prototype is fundamentally different from the one you would produce if you had to code this in C/C++, since I'm sure that if you were to build this in C++ you wouldn't spawn a thread of computation each time one of the millions of live objects needs to do something, and you wouldn't expect the application to work while thousands of these threads are all executing in a given moment.

While it's true that spawning goroutines to perform atomic tasks and communicating using channels is often the good and idiomatic way to write Go code, this is not always true. Sometimes, the kind of application you need to build forces you to design an architecture that comes closer to the one you'd design when you're writing C/C++ code.

randall77 commented 5 years ago

I'm not surprised that your GUI response is terrible. You have 10000 goroutines, and only one of them is the GUI goroutine. The runtime has no way to know which of those 10000 is your latency critical one. The scheduler tries to be at least somewhat fair to all the goroutines, which means your GUI one is only getting 1/10000 of the machine. A fix for this kind of application would require goroutine priorities. We don't have that at the moment, but you can simulate it with runtime.LockOSThread and then using the OS priority mechanisms.

reading and writing from channels (according to the documentation) invokes a co-operative multi-tasking so adding "runtime.Gosched()" at that point should have no effect but clearly it does have affect so the documentation is WRONG. Doing IO is not sufficient for not having to add "runtime.Gosched()" in loops.

You're not reading and writing channels in your loop. The only operation in the loop is a waitgroup add and a goroutine spawn. The operations inside the goroutine don't count.

Nevertheless, your use of runtime.GoSched isn't really necessary. It is required only for tight loops that do not have any other calls in them. But your usage does have the side effect of lowering the priority of the spawning loop in BurnCPU, which helps limit the pool of runnable goroutines (by running the ones you spawned and getting them to a blocking channel op before spawning more) and hence gives the GUI goroutine a larger fraction of the CPU.

adding "runtime.Gosched()" at that point should have no effect

I'm not sure where you are getting that from. Gosched is not required, but I'm not sure where you're getting the assertion that it should have no effect at all. It certainly can, and currently acts as a signal that other goroutines should execute in preference to the calling one (as described in the godoc for Gosched).

observed values of memstat percentage CPU spend on GC and cumulative duration of GC stops does not match (indicated 73% cpu usage during 200sec while cumulative time showed 0.47sec)

The GC is concurrent with program execution. The GC stops are very short and most of the GC work is done while the application is also running. That may explain the numbers you're seeing.

usage of map relative to fixed array with modulus index (simulating map) caused serious performance degradation in excess of 18x (unexpected due to map having no locking)

Maps are definitely slower than arrays. 18x does not sound unreasonable. Does that mean maps are really slow, or arrays are really fast? Array access can be as little as one instruction. Map access is at least a function call, and is hundreds of instructions even on the fast path.

Documented advances on reducing the delay time of GC "stop the world" are "misleading". The advance is due to the way the stops are measured (making it such that GC calls intersect with idle time of the application). However functions that do not actively WORK with this logic (by calling "runtime.Gosched()" all the time) seem to block the start of GC cleanup WHILE BLOCKING ALL OTHER GO ROUTINES. So this documented method of measuring GC performance is misleading (put politely).

I'm not sure how this relates to the benchmark provided. How are you determining that GC cleanup is blocking all other goroutines? It shouldn't, unless there's a bug somewhere. If there's a bug, we'd need to understand how you're reaching that conclusion. Ideally, with a self-contained example that we can run.

The results now look better and the blocking of the interactive part of the program is gone. So this is good BUT there is no method of knowing how many of the runtime.NumGoroutine() are blocked (e.g. waiting for channel input) so this code is a potential DEADLOCK. So to make this solution work properly "runtime.NumGoroutine()" would have to be "runtime.NumUnblockedGoroutine()" (returning the number of go routines that are active plus the number of go routines that are pending to become active, so not counting go routines that wait for an IO/timer event).

Something like this would be really nice. We'd love to be able to stop the "spawning goroutine" when we have enough ready goroutines to saturate the cpus, and only resume the spawning goroutine when some of the spawned goroutines block or finish. Unfortunately, there's no easy way to look into the future behavior of goroutines; maybe the goroutine that was the spawning goroutine is actually done spawning and will next do something latency sensitive. There might be heuristics we can do, but it's tricky.

StephanVerbeeck commented 5 years ago

So this is good BUT there is no method of knowing how many of the runtime.NumGoroutine() are blocked (e.g. waiting for channel input) so this code is a potential DEADLOCK. So to make this solution work properly "runtime.NumGoroutine()" would have to be "runtime.NumUnblockedGoroutine()" (returning the number of go routines that are active plus the number of go routines that are pending to become active, so not counting go routines that wait for an IO/timer event).

Something like this would be really nice. We'd love to be able to stop the "spawning goroutine" when we have enough ready goroutines to saturate the cpus, and only resume the spawning goroutine when some of the spawned goroutines block or finish. Unfortunately, there's no easy way to look into the future behavior of goroutines; maybe the goroutine that was the spawning goroutine is actually done spawning and will next do something latency sensitive. There might be heuristics we can do, but it's tricky.

Agreed, but applies also to "runtime.NumGoroutine()" as the number of go routines can increase or decrease between the moment we test it and the moment we launch our own go routine. From what I have measured today having slightly more active go routines than worker threads has little impact. As long as the number of active go routines fluctuates close to the number of worker threads the performance peaks.

Without the throttling logic e.g. ...

func ThrottleGO() {
    maxGoroutine := runtime.NumCPU() * 2
    nanoSec := time.Duration(0)
    for runtime.NumGoroutine() > maxGoroutine {
        time.Sleep(nanoSec)
        if nanoSec < 1e8 { // 100msec
            nanoSec += 1e5 // 0.1msec
        }
    }
}

... the amount of used memory for the same logic fluctuates wildly. With ...

outer := int(1e3)
inner := int(1e7)

...the amount of used (private allocated virtual pages) memory is 12Gb while with...

outer := int(1e7)
inner := int(1e3)

...the amount of used (private allocated virtual pages) memory is 260Mb. That is while in theory the amount of used CPU or memory should not change.

StephanVerbeeck commented 5 years ago

Finished optimisation by making my own version of the go routine logic. Sharing the code here, average speedup (tested various combinations of load) is >300x

func burnCPU() {
    type aha struct{ a, b int }
    com := make(chan *aha, 100)
    throttle := Throttle{}
    for i := 1; i <= 10; i++ {
        go func() {
            oho := map[int]*aha{}
            for {
                c := <-com
                if c == nil {
                    break
                }
                oho[c.a+c.b] = c
            }
        }()
    }
    outer := int(1e5)
    inner := int(1e5)
    throttle.Expect(outer)
    for a := 1; a <= outer; a++ {
        throttle.Go(func() {
            a := a
            for b := 1; b <= inner; b++ {
                b += 1234
                c := new(aha) // allocate some memory to cause GC
                c.a = 10 * (a)
                c.b = 10 * (b - 1234)
                b -= 1234
                if b%100 == 0 {
                    com <- c
                }
            }
        })
    }
    routines := runtime.NumGoroutine()
    _ = routines
    throttle.Wait()
    close(com)
    routines = runtime.NumGoroutine()
    memStats := runtime.MemStats{}
    runtime.ReadMemStats(&memStats)
    return // <- set debug break here and examine memStats
}

Throttle.go

package main

// Alternative for sync.WaitGroup that uses automatic memory and CPU throttling

import (
    "runtime"
    "sync/atomic"
)

var (
    throttleNextGo Barrier
)

// Throttle CPU and memory usage by limiting amount of concurrent go routines
type Throttle struct {
    expected, started, done, max uint32
}

// Expect number of routines to launch (only required when calling Wait() before all routines are started)
func (tg *Throttle) Expect(expected int) {
    tg.expected = uint32(expected)
}

// Go routine (call to start go routine)
func (tg *Throttle) Go(action func()) {
    if tg.max == 0 {
        tg.max = uint32(runtime.NumCPU())
    }

    for tg.started-tg.done >= tg.max {
        throttleNextGo.Wait(100)
    }

    doit := func() {
        defer func() {
            atomic.AddUint32(&tg.done, 1)
            throttleNextGo.Open()
        }()
        action()
    }
    atomic.AddUint32(&tg.started, 1)
    go doit()
}

// Wait till last go routine is done (must have started at least 1 go routine)
func (tg *Throttle) Wait() {
    for tg.started > tg.done || tg.max == 0 || (tg.expected != 0 && tg.done < tg.expected) { // all routines done after starting at least one
        throttleNextGo.Wait(100)
    }
    *tg = Throttle{} // reset to make next Wait() work in case same throttle is used again
}

Barrier.go

package main

import (
    "golang.org/x/sys/windows"
)

/*
Barrier implements the ability to synchronize with another thread by waiting for the barrier until the other threads opens the barrier for us.
Each time the barrier is opened only a single thread can pass through before it closes again!
NOTE: The barrier does not remain open when nobody is waiting for it.  Only who is waiting when the barrier goes open can go through!
*/
type Barrier struct {
    handle   windows.Handle
    oneByOne bool
}

// Open barrier to wakeup threads that are waiting for the barrier
func (b *Barrier) Open() {
    if b.handle == windows.Handle(0) {
        b.handle, _ = windows.CreateEvent(nil, 0, 0, nil)
    }
    if b.oneByOne {
        windows.SetEvent(b.handle) // Release a single waiting thread
    } else {
        windows.PulseEvent(b.handle) // Release all waiting threads
    }
}

// Wait will suspend current thread until the barrier is opened by another thread, returns false on timeout
func (b *Barrier) Wait(msec int) bool {
    stat, _ := windows.WaitForSingleObject(b.handle, uint32(msec))
    return stat == windows.WAIT_OBJECT_0
}

// Cleanup will release used system resources
func (b *Barrier) Cleanup() {
    if b.handle != windows.Handle(0) {
        windows.CloseHandle(b.handle)
        b.handle = windows.Handle(0)
    }
}
networkimprov commented 5 years ago

What is the key to the latter code's success? Using the Windows API for scheduling?

Have you used the stdlib testing.B benchmark feature? Could it apply here? I'd find it easier to evaluate the slow and fast samples if they were compared one after the other. Also it would help to benchmark just the scheduling mechanism, as any workload to be scheduled takes the same time in either case.

Why do you need func doit (vs go func(){...}()), and why does it have a deferred call after a single stmt?

StephanVerbeeck commented 5 years ago

What is the key to the latter code's success?

Avoiding that the L1,L2,L3 cache of the processor is not used due to an extreme number of context switches per second. The solution has as main logic that the start of the next go routine is delayed until one of the processor cores becomes available.

So if you need to start 1000 go routines and have an 8 core processor then the first 8 start immediately and the 9th must wait until one of the first 8 is done and then every time any of the go routines is done, the next one starts. If the application go routines have idle time in them then throttle.max should be manually set to a higher value than the number of processor cores and the optimal value then depends on what the started routines exactly do. e.g. if there is 50% time that is not computational then you could do "throttle.max = uint32(runtime.NumCPU() *2)"

Using the Windows API for scheduling?

Barrier.go only exists because neither channels and neither any of the sync objects worked for this purpose. It seems GO only has locking OS features but does not implement OS signalling of any kind. That the solution I showed here is using the windows API (and is not multiplatform) is due to the fact that the existing windows native comparison code was ported from windows C++ to GO.

Have you used the stdlib testing.B benchmark feature?

No. used stopwatch for timing and windows task manager for looking at amount of private allocate virtual memory.

Could it apply here?

Yes, but if you have severe software timing troubles measuring time duration with software is not guaranteed to be accurate.

I'd find it easier to evaluate the slow and fast samples if they were compared one after the other. Also it would help to benchmark just the scheduling mechanism, as any workload to be scheduled takes the same time in either case.

Yes deeper investigation is required.

Why do you need func doit (vs go func(){...}()), and why does it have a deferred call after a single stmt?

The defer is needed to make sure that the interlocked increment of the "done" counter is also done in case there was an error in the execution of the passed doit() action. That is cleanup code that always must be executed and hence there is one extra level of call on the stack.

networkimprov commented 5 years ago

Did you try using a buffered channel as a semaphore, like:

sem := make(chan struct{}, runtime.NumCPU())
for i := 0; i < runtime.NumCPU(); i++ {
   sem <- struct{}{} // load the semaphore
}

Then let each goroutine get/put the channel on entry/exit. That will limit the number of running goroutines.

creker commented 5 years ago

So if you need to start 1000 go routines and have an 8 core processor then the first 8 start immediately and the 9th must wait until one of the first 8 is done and then every time any of the go routines is done, the next one starts.

So, basically, you implemented goroutine pool. There're countless of libraries that let you do exactly that. As was said earlier, the scheduler tries to be fair and gives execution time to every goroutine. Just like you wouldn't start 1000 CPU intensive OS threads in, say, C++, you wouldn't start 1000 CPU intensive goroutines. You would use thread/goroutine pool. Go is not doing any magic here.

The general rule is pretty simple. For IO intensive work where you mostly wait on IO, you can spawn thousands of goroutines. For CPU intensive work you spawn as much goroutines as you have CPU cores/threads.

StephanVerbeeck commented 5 years ago

Really embarrassing the lack of insight in the responses here (including the number of thumbs up for imaginary scenarios that didn't work) . Please try your suggestions before you make them.

creker commented 5 years ago

What's there to try? Your case is a pretty simple one - a bunch of CPU work. The solution is also pretty simple and well known - thread pool. Unless your code is doing something else you didn't mention, it's just a thread pool. You said it yourself.

mvdan commented 5 years ago

@StephanVerbeeck once again, please keep it civil. The people in this thread are trying to figure out what's going on and help.

networkimprov commented 5 years ago

I regularly use buffered channels as semaphores; they work fine.

If you can show that a buffered channel is vastly less efficient than your barrier implementation in the WinAPI, benchmark both with code that only exercises Open/Wait() or channel get/put, then post your benchmark code to a new issue proposing a new sync type.

Also you might want to look around for third-party semaphore implementations. There's probably a handful doing native API calls like yours.

StephanVerbeeck commented 5 years ago

This thread is about go routines not doing as they advertise. go routines were supposed to be lightweight alternatives to threads. I have proven that they only SLOW DOWN code and are only usable in combination with own threadpoling code written in go. So all you smart people can anybody tell me why to develop in GO if I need to do all the same "write my own logic" as in C++. What is the point of having GO if all this low-level stuff still needs to be take care of by the programmer? Instead of that somebody would try out the bench mark I see just a list of attacks and personal insults. I have seen nothing like this in my 40+ years as lead developer. !!! Somebody needs to get to work on this issue and urgently !!!

mvdan commented 5 years ago

@StephanVerbeeck noone is attacking or insulting you. If anything, in this thread you are the only one being agressive.

I really don't like temporarily locking heated threads, but that will be the only option left if you don't follow https://golang.org/conduct. In particular, there's no need to write in all caps, give instructions with exclamation marks, or refer to others as "all you smart people".

networkimprov commented 5 years ago

To run a lot of cpu-intensive tasks on a few cores, you can either try to juggle them all, at a cost of context switching, or prioritize them, at the cost of importing or writing scheduling logic.

The Go runtime cannot read your mind that you need prioritization; after all there are endless ways to prioritize things. I regret that you got a different impression (altho I'm not to blame me for any Go docs :-) But it's not hard to write (or find on Github) cross-platform scheduling logic in Go.

@mvdan, I think you can close this issue now.

networkimprov commented 5 years ago

Possibly related: #32398