time: excessive CPU usage when using Ticker and Sleep

lni commented 5 years ago

What version of Go are you using (`go version`)?

go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64" GOBIN="" GOCACHE="/home/lni/.cache/go-build" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOOS="linux" GOPATH="/home/lni/golang_ws" GOPROXY="" GORACE="" GOROOT="/usr/local/go" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64" GCCGO="gccgo" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build440058908=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I need to call a function roughly every millisecond, there is no real time requirement, as long as it is called roughly every millisecond everything is fine. However, I realised that both time.Ticker and time.Sleep() are causing excessive CPU overhead, probably due to runtime scheduling.

The following Go program uses 20-25% %CPU as reported by top on Linux.

package main

import (
  "time"
)

func main() {
  ticker := time.NewTicker(time.Millisecond)
  defer ticker.Stop()
  for {
    select {
    case <-ticker.C:
    }
  }
}

for loop with range is causing similar overhead

package main

import (
  "time"
)

func main() {
  ticker := time.NewTicker(time.Millisecond)
  defer ticker.Stop()
  for range ticker.C {
  }
}

while the following program is showing 10-15% %CPU in top

package main

import (
  "time"
)

func main() {
  for {
    time.Sleep(time.Millisecond)
  }
}

To workaround the issue, I had to move the ticker/sleep part to C and let the C code to call the Go function that need to be invoked every millisecond. Such a cgo based ugly hack reduced %CPU in top to 2-3%. Please see the proof of concept code below.

ticker.h

#ifndef TEST_TICKER_H
#define TEST_TICKER_H

void cticker();

#endif // TEST_TICKER_H

ticker.c

#include <unistd.h>
#include "ticker.h"
#include "_cgo_export.h"

void cticker()
{
  for(int i = 0; i < 30000; i++)
  {
    usleep(1000);
    Gotask();
  }
}

ticker.go

package main

/*
#include "ticker.h"
*/
import "C"
import (
  "log"
)

var (
  counter uint64 = 0
)

//export Gotask
func Gotask() {
  counter++
}

func main() {
  C.cticker()
  log.Printf("Gotask called %d times", counter)
}

What did you expect to see?

Much lower CPU overhead when using time.Ticker or time.Sleep()

What did you see instead?

20-25% %CPU in top

agnivade commented 5 years ago

I am unable to reproduce this. Both programs show 5-6% CPU for me. And I have 4 CPUs. When you say 20-25%, how many CPUs do you have ?

lni commented 5 years ago

As described, all reported % figures are per process %CPU value reported in top.

I tried two servers, one with dual E5-2690v2 (10 cores X 2 with HT) and another server with a single E5-2696v4 (22 cores X 1 with HT). Other people have confirmed high CPU usage on their Mac OS machines as well, see below.

https://groups.google.com/forum/#!topic/golang-nuts/_XEuq69kUOY

djadala commented 5 years ago

Hi, Same here, Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21) x86_64 GNU/Linux, 4x AMD Phenom(tm) II X4 965 Processor,

top:

Tasks: 362 total,   1 running, 361 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  0.4 sy,  0.0 ni, 98.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8142260 total,   607488 free,  2363780 used,  5170992 buff/cache
KiB Swap:  4194300 total,  4144460 free,    49840 used.  5385516 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                   
 1689 djadala   20   0    2476    708    520 S  22.8  0.0   0:01.90 ticker1

go version: go version go1.10.3 linux/amd64

ticker1.go:

package main

import (
    "time"
)

func main() {
    ticker := time.NewTicker(time.Millisecond)
    defer ticker.Stop()
    for range ticker.C {
    }
}

bcmills commented 5 years ago

Dup of #25471?

mbyio commented 5 years ago

Percentage of CPU use is not very specific. Can you describe how it is actually affecting you? Is it preventing the program from working correctly somehow? Go is not necessarily specifically optimized for minimal CPU usage.

bcmills commented 5 years ago

Can you describe how it is actually affecting you? Is it preventing the program from working correctly somehow?

On a laptop or other mobile device, percentage of CPU is more-or-less equivalent to battery lifetime. We should not waste users' battery capacity needlessly.

(Using a lot more CPU than expected in a real program is a problem in and of itself: it doesn't need extra justification.)

choleraehyq commented 5 years ago

@bcmills I'd like to implement time.Ticker with timerfd to solve this issue, but timerfd_* APIs are merged into kernel 2.6.24, as our minimal kernel requirement is 2.6.23. When can we drop support of kernel 2.6.23?

bcmills commented 5 years ago

When can we drop support of kernel 2.6.23?

For that, you should probably file a proposal.

ianlancetaylor commented 5 years ago

@choleraehyq Can you sketch out how using the timerfd functions will help? It's not immediately obvious to me. Thanks.

gmox commented 5 years ago

I can confirm this behavior. It looks as if NewTicker is leaving futex locks open, and that drives up CPU/memory usage. Using After does not result in increased CPU/memory usage.

djadala commented 5 years ago

Hi,

Using After does not result in increased CPU/memory usage

Please include code that you use to measure. Here is my code, sleep & timer use 10%cpu, ticker use 20%cpu :

package main

import (
    "flag"
    "fmt"
    "time"
)

func ticker() {

    ticker := time.NewTicker(time.Millisecond)
    defer ticker.Stop()
    for range ticker.C {
    }
}

func timer() {
    t := time.NewTimer(time.Millisecond)
    for range t.C {
        t.Reset(time.Millisecond)
    }
}

func sleep() {
    for {
        time.Sleep(time.Millisecond)
    }
}

func main() {
    t := flag.String("t", "timer", "use timer")
    flag.Parse()
    switch *t {
    case "timer":
        timer()
    case "ticker":
        ticker()
    case "sleep":
        sleep()
    default:
        fmt.Println("use  timer, ticker or sleep")
    }
}

I think that 10% is also high cpu usage.

lni commented 5 years ago

I tried djadala's program above on an Amazon EC2 a1.medium node with 1 ARM Cortex-A72 core, OS is Ubuntu 18.04LTS with a 4.15.0-1028-aws kernel, Go arm64 version 1.11.2. Interestingly, the CPU usage is only around 3% for ticker/timer/sleep.

djadala commented 5 years ago

for reference i included syscall and cgo variants, cgo use 2% cpu, and syscall use 12%

package main

import (
    "flag"
    "fmt"
    "time"

    "syscall"
)

// #include <unistd.h>
// //#include <errno.h>
// //int usleep(useconds_t usec);
import "C"

func usleep() {
    for {
        C.usleep(1000)
    }
}

func sysns() {
    for {
        v := syscall.Timespec{
            Sec:  0,
            Nsec: 1000,
        }
        syscall.Nanosleep(&v, &v)
    }
}

func ticker() {

    ticker := time.NewTicker(time.Millisecond)
    defer ticker.Stop()
    for range ticker.C {
    }
}

func timer() {
    t := time.NewTimer(time.Millisecond)
    for range t.C {
        t.Reset(time.Millisecond)
    }
}

func sleep() {
    for {
        time.Sleep(time.Millisecond)
    }
}

func main() {
    t := flag.String("t", "timer", "use timer")
    flag.Parse()
    switch *t {
    case "timer":
        timer()
    case "ticker":
        ticker()
    case "sleep":
        sleep()
    case "cgo":
        usleep()
    case "sys":
        sysns()
    default:
        fmt.Println("use  timer, ticker, sys, cgo or sleep")
    }
}

robaho commented 5 years ago

I did some testing on this, comparing Go, C, and Java. The Go uses time.Sleep(), C uses usleep(), and Java uses LockSupport.ParkNanos(). Interestingly, Java and Go use the same 'timeout on semaphore' to implement. The following shows CPU utilization (on OSX).

System	1 us	100 us	1 ms
Go	130%	23.0%	4.6%
Java	67%	6.0 %	1.5%
C	42%	0.6 %	0.0%

I believe the performance issue in Go is due to multiple threads being involved (the timerproc, and the user routine) causing a lot of overhead.

The simplest solution, although it doesn't work for channel select timeout, would be to have short duration time.Sleep() just invoke usleep(). There are probably very few threads using timers of this duration, so the overhead of allocating and blocking the thread shouldn't be that great. I would expect the resulting runtime to be superior to Java and on par with C.

In fact, it is doubtful a sleep request for less than 10 us should even sleep at all - just call runtime.Gosched() since the overhead of a context switch is probably longer than the sleep request.

robaho commented 5 years ago

I did some more testing by writing a CGo function USleep that calls C's usleep directly.

System	1 us	100 us	1 ms
CGo	50%	5.3%	1.3%

So this appears to be a viable alternative for polling with a short interval and low polling routine count.

ls0f commented 5 years ago

Same issue. Hope Go team to decrease the cpu cost of ticker in Linux.

ianlancetaylor commented 5 years ago

@ls0f Please see https://golang.org/wiki/NoPlusOne . Thanks.

networkimprov commented 5 years ago

@ianlancetaylor maybe this could be assigned to someone?

ianlancetaylor commented 5 years ago

I can't tell anybody to do anything. I can assign it to someone but that doesn't mean anything will happen.

Rereading this I would like to double-check that there is still something to do here. @dvyukov sped up the timer code quite a bit in the 1.12 release while working on #25729. To what extent is this still a problem?

dvyukov commented 5 years ago

We still handle timers on a separate thread, so there are like 4 excessive thread context switches per timer. Timers need to be part of scheduler/netpoll, i.e. call epoll_wait with the timers timeout.

ianlancetaylor commented 5 years ago

I wonder if it would make sense to say that if we call time.Sleep for a time <= 10 milliseconds, and the number of active goroutines is <= GOMAXPROCS, then we should just sleep rather than going through the timer thread. That might handle microbenchmarks like this, including client programs, without compromising servers. (I chose 10 milliseconds based on sysmon).

robaho commented 5 years ago

That was my proposal - I think it is a good one :) but I think 10 ms might be too high. Although I could see reasonable per platform defaults with an environment variable to override for special cases.

On Mar 8, 2019, at 8:30 AM, Ian Lance Taylor notifications@github.com wrote:

I wonder if it would make sense to say that if we call time.Sleep for a time <= 10 milliseconds, and the number of active goroutines is <= GOMAXPROCS, then we should just sleep rather than going through the timer thread. That might handle microbenchmarks like this, including client programs, without compromising servers. (I chose 10 milliseconds based on sysmon).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ChrisHines commented 5 years ago

This issue sounds similar to what I described in #28808. In that issue a lot of the unexpected CPU load was due to the work stealing code in the scheduler after goroutines go to sleep.

It would be nice if we could reduce that CPU load when there are short lived timers pending.

I wonder if it would make sense to say that if we call time.Sleep for a time <= 10 milliseconds, and the number of active goroutines is <= GOMAXPROCS, then we should just sleep rather than going through the timer thread.

What do you mean by "active goroutines <= GOMAXPROCS"? Does active mean "existing" or does it mean "in a runnable state"?

dvyukov commented 5 years ago

Am I reading this right that we want to both complicate runtime and not improve real programs and mask problems of real programs in microbenchmarks so that it will be harder to analyze performance of real programs? :) And in the end it may actually negatively affect real programs. Consider a server does not utilize all CPUs at 100% (which they usually do), we eat several threads for sleeping, then there is a spike of load but the threads are wasted for sleeping.

robaho commented 5 years ago

No you are reading it wrong. For short duration timers it is more efficient to call usleep rather than using the multi thread park notify currently used.

It is not a hack to make micro benchmarks perform better.

On Mar 8, 2019, at 9:02 AM, Dmitry Vyukov notifications@github.com wrote:

Am I reading this right that we want to both complicate runtime and not improve real programs and mask problems of real programs in microbenchmarks so that it will be harder to analyze performance of real programs? :) And in the end it may actually negatively affect real programs. Consider a server does not utilize all CPUs at 100% (which they usually do), we eat several threads for sleeping, then there is a spike of load but the threads are wasted for sleeping.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ChrisHines commented 5 years ago

Would usleep act like a syscall and consume an OS thread (an M in scheduler parlance) for each short sleep? If so, it seems like that would not behave nicely if there are many goroutines calling short sleeps concurrently as in the program described in #28808.

robaho commented 5 years ago

I tried to think of a use case where that would be the case and couldn’t... I think there would be at most a handful - thinking high frequency audio or IO timers.

On Mar 8, 2019, at 9:55 AM, Chris Hines notifications@github.com wrote:

Would usleep act like a syscall and consume an OS thread (an M in scheduler parlance) for each short sleep? If so, it seems like that would not behave nicely if there are many goroutines calling short sleeps concurrently as in the program described in #28808.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

dvyukov commented 5 years ago

No you are reading it wrong. For short duration timers it is more efficient to call usleep rather than using the multi thread park notify currently used. It is not a hack to make micro benchmarks perform better.

But we are losing whole P in this case and not doing useful work on it. I believe it will be faster on a synthetic benchmark. But we are wasting resources.

ChrisHines commented 5 years ago

Our use case is described in general terms in the issue I linked. It is real and we're doing it at a large scale, meaning, ~3500 timers averaging a 2.4ms period on a 56 CPU box (and many of those boxes). At present we have to work around behaviors of the Go runtime in various ways to keep CPU load under control.

Losing a whole P as @dvyukov points out is a non-starter for us.

robaho commented 5 years ago

You don’t do it that way. You have multiple timer threads based on NumProcs and call your handlers directly.

It’s custom code, but you certainly don’t have a general case. You have 62 timers per core, each at 500 events per sec, so 312k events per second... at 312k events per second on a core you are not doing much. 3 usecs is not even a system call.

On Mar 8, 2019, at 10:15 AM, Chris Hines notifications@github.com wrote:

Our use case is described in general terms in the issue I linked. It is real and we're doing it at a large scale, meaning, ~3500 timers averaging a 2.4ms period on a 56 CPU box (and many of those boxes). At present we have to work around behaviors of the Go runtime in various ways to keep CPU load under control.

Losing a whole P as @dvyukov points out is a non-starter for us.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ChrisHines commented 5 years ago

@robaho Yes, that strategy is one we've been pursuing, but it's not as simple as it sounds. Taken to conclusion it amounts to implementing a custom scheduler on top of the Go scheduler. I don't think this issue discussion is an appropriate place to design such a thing, so let's just leave that line of thought at that.

In the interest of staying focused on the Go runtime here, I think it is fair to say that it would be nice if the Go scheduler could handle some of these use cases better. Then projects like ours wouldn't have to write a custom scheduler to work around the existing behavior.

robaho commented 5 years ago

I only meant that there are always design trade offs. I think the general case (low latency timer) can be improved in Go, and should be, but I don’t think it would solve your particular problem.

On Mar 8, 2019, at 10:43 AM, Chris Hines notifications@github.com wrote:

@robaho Yes, that strategy is one we've been pursuing, but it's not as simple as it sounds. Taken to conclusion it amounts to implementing a custom scheduler on top of the Go scheduler. I don't think this issue discussion is an appropriate place to design such a thing, so let's just leave that line of thought at that.

In the interest of staying focused on the Go runtime here, I think it is fair to say that it would be nice if the Go scheduler could handle some of these use cases better. Then projects like ours wouldn't have to write a custom scheduler to work around the existing behavior.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ChrisHines commented 5 years ago

Fair enough, @robaho . Getting back to your usleep suggestion, if it did block the P and a program had enough goroutines doing short sleeps to block all the available Ps then the whole program would be unable to make progress on other non-sleeping goroutines during that time. I don't think we want that to happen.

robaho commented 5 years ago

I am pretty sure it doesn’t work that way. The runtime will always create a new thread if all others are blocked, regardless of GOMAXPROCS.

On Mar 8, 2019, at 11:00 AM, Chris Hines notifications@github.com wrote:

Fair enough, @robaho . Getting back to your usleep suggestion, if it did block the P and a program had enough goroutines doing short sleeps to block all the available Ps then the whole program would be unable to make progress on other non-sleeping goroutines during that time. I don't think we want that to happen.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

robaho commented 5 years ago

One other thing, I think you would quickly reach a steady state of P anyway as the intervals are so short that as long as the Go routine made a function call they would be descheduled anyway, meaning that there would not be that many routines in the sleep at a given moment.

But I think the most appropriate implementation would be to use usleep (when below an interval threshold) up to some number of blocking threads and then switch to the current multiplexing mode.

On Mar 8, 2019, at 11:00 AM, Chris Hines notifications@github.com wrote:

Fair enough, @robaho . Getting back to your usleep suggestion, if it did block the P and a program had enough goroutines doing short sleeps to block all the available Ps then the whole program would be unable to make progress on other non-sleeping goroutines during that time. I don't think we want that to happen.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dvyukov commented 5 years ago

I am pretty sure it doesn’t work that way.

But then why/how will it be faster then the current code? The thread switched required to pass P across threads seem to be effectively the same as what we have now.

robaho commented 5 years ago

It would be better because often there are very few of these timers needed. And the handling code can be local. I think the cpu overhead shown in the posts explains why it would be beneficial in this case.

On Mar 8, 2019, at 12:02 PM, Dmitry Vyukov notifications@github.com wrote:

I am pretty sure it doesn’t work that way.

But then why/how will it be faster then the current code? The thread switched required to pass P across threads seem to be effectively the same as what we have now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dvyukov commented 5 years ago

I fail to understand the difference. We already do a sleep. I see how it would be different if we would take a P, but you say we need to allow runtime to continue running Go code on the P. And then it does not seem to be different from what we have now.

robaho commented 5 years ago

I think if you review the earlier posts with the timings you will see the overhead difference between using usleep and the current runtime method - for small interval sleeps. Not sure how else to explain it.

On Mar 8, 2019, at 12:12 PM, Dmitry Vyukov notifications@github.com wrote:

I fail to understand the difference. We already do a sleep. I see how it would be different if we would take a P, but you say we need to allow runtime to continue running Go code on the P. And then it does not seem to be different from what we have now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dvyukov commented 5 years ago

Okay, that's pretty synthetic conditions where sysmon thread decides to not retake the P and cools down as the result as there is no other work to do. What exactly predicate do you propose to use? How do we detect that "few timers" scenario and ensure that we don't do any harm (there is lots of potential)? And this is merely a stop-gap before we do the proper solution that removes thread switches, right?

robaho commented 5 years ago

The predicate can be trivial

If sleeptime < threshold and atomicinc sleepers < nprocs /2 usleep atomicdec sleepers else Do what we already do...

I don’t think you need to over complicate it. And it is not synthetic. The current polling timer technique is not efficient for small interval polling timers as compared to that available on other platforms.

On Mar 9, 2019, at 2:09 AM, Dmitry Vyukov notifications@github.com wrote:

Okay, that's pretty synthetic conditions where sysmon thread decides to not retake the P and cools down as the result as there is no other work to do. What exactly predicate do you propose to use? How do we detect that "few timers" scenario and ensure that we don't do any harm (there is lots of potential)? And this is merely a stop-gap before we do the proper solution that removes thread switches, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

networkimprov commented 5 years ago

And there's a new runtime pkg API to support a configurable threshold, e.g. your "environment variable to override for special cases".

robaho commented 5 years ago

Are all environment variables the runtime uses today also available programmatically ? I don’t think that is the case. I’m not even certain it needs to be configurable as the runtime can determine to context switch latency, and sleep resolution, and calculate the threshold from that. The environment var is optional, as there is no real downside to it, and it can already be configured in a way via GOMAXPROCS.

I still find it pretty funny the level of argument for a 4 line change in the face of hard numbers and reasoning.

There may be better long term ways to handle this, but that shouldn’t prevent immediate fixes.

Regardless it’s not that big of a deal since users can just use cgo to call usleep today.

On Mar 9, 2019, at 2:01 PM, Liam notifications@github.com wrote:

And there's a new runtime pkg API to support a configurable threshold, e.g. your "environment variable to override for special cases".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

networkimprov commented 5 years ago

Ah yes, a new env variable would suffice. I think it's advisable to include that.

IMO, "hard numbers" result from empirical study and all we have is simple benchmarks.

The issue tracker sees a ton of traffic; discerning which of it is soundly reasoned can be a challenge :-)

networkimprov commented 5 years ago

Sounds like we have a consensus? Can we tag this NeedsFix?

ianlancetaylor commented 5 years ago

I'm not sure we have a consensus here.

networkimprov commented 5 years ago

Could we try the usleep() approach as an experiment during this cycle, and see if any issues appear?

gopherbot commented 5 years ago

Change https://golang.org/cl/169975 mentions this issue: runtime: add new runtimer function

gopherbot commented 5 years ago

Change https://golang.org/cl/169973 mentions this issue: runtime: implement new movetimers function

gopherbot commented 5 years ago

Change https://golang.org/cl/169972 mentions this issue: runtime: add new cleantimers function

golang / go