runtime: >200ms GC latency in test programs (calling runtime.GC?)

genez commented 8 years ago

What version of Go are you using (go version)? go version go1.6.2 windows/amd64
What operating system and processor architecture are you using (go env)? set GOARCH=amd64 set GOBIN= set GOEXE=.exe set GOHOSTARCH=amd64 set GOHOSTOS=windows set GOOS=windows set GOPATH=C:\dev\Go set GORACE= set GOROOT=C:\Go set GOTOOLDIR=C:\Go\pkg\tool\windows_amd64 set GO15VENDOREXPERIMENT=1 set CC=gcc set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 set CXX=g++ set CGO_ENABLED=1
What did you do? Hello, I am writing a POC software that should replace a currently running C application. The server part loads a SQLITE db3 file in memory (using maps) and then waits for text commands on UDP socket.

The client reads a text file with commands, sends them to the server and waits for a response. The client takes measurement on server reply time.

My tests were taken on the same machine (localhost) to avoid network delays

While running the client, response time jumps from 3 milliseconds to more than 200 milliseconds and in the same moment the server displays this line (I am running it with SET GODEBUG=gctrace=1)

gc 14 @18.400s 2%: 0.50+380+1.0 ms clock, 2.0+294/380/266+4.0 ms cpu, 344->344->182 MB, 359 MB goal, 4 P

I think that 200ms STW is a bit too much and would like to know how to fix this beahvior I have a dropbox folder with server and client sources, and corresponding SQLITE DB and command text file (https://www.dropbox.com/sh/e2timsyjifvp61i/AAAmDoKSP7NANjDlsYcGcBvGa?dl=0)

the client must be executed with 2 parameters:

serverip:port (4420)
sendername (L1)

My concern is not the throughput but the latency Our application drives an automated machine that must have the response back in 10ms

genez commented 8 years ago

Maybe related to this issue: https://github.com/golang/go/issues/9477

genez commented 8 years ago

Just changing the key of the maps from string to int drop the GC time to 5 milliseconds

RLH commented 8 years ago

First kudo's for taking end to end latencies instead of just focusing on GC latencies. Did the end to end latency also drop when you changed the map keys?. Could you send up the complete before and after gctrace? How many goroutines does the application have and what is the nature of the goroutines work.

When reading the gctrace line you sent. The 2% is the overall time spend in GC. the 0.50 and the 1.0ms are the STW times. The 380ms is the time spent doing the concurrent mark. During the concurrent mark we have seen the GC starve the mutators (your application) in certain situations and fixed several things in this area for 1.7. You might want to try tip to see if this will be a problem going forward.

genez commented 8 years ago

Sorry but the test with int keys was not real I was failing to convert to int and then adding every item at key 0

I tried with [20]byte keys but I have the same issue

C:\dev\ETSGO\etsgo>etsgo gc 1 @0.046s 2%: 1.1+1.8+0.75 ms clock, 2.3+0/1.8/0.80+1.5 ms cpu, 4->4->2 MB, 5 MB goal, 4 P gc 2 @0.066s 4%: 0+3.0+1.2 ms clock, 0+1.0/1.0/1.0+4.8 ms cpu, 4->4->2 MB, 5 MB goal, 4 P gc 3 @0.085s 5%: 0+13+1.0 ms clock, 0+3.0/2.0/1.0+4.0 ms cpu, 5->7->6 MB, 6 MB goal, 4 P gc 4 @0.131s 4%: 0+3.0+0 ms clock, 0+0/0/3.0+0 ms cpu, 7->7->4 MB, 9 MB goal, 4 P gc 5 @0.171s 3%: 0+7.0+0 ms clock, 0+4.0/3.0/9.0+0 ms cpu, 10->10->9 MB, 11 MB goal, 4 P gc 6 @0.179s 5%: 0+7.0+1.0 ms clock, 0+3.0/7.0/0+4.3 ms cpu, 13->13->13 MB, 18 MB goal, 4 P gc 7 @0.304s 5%: 0+19+0 ms clock, 0+19/9.0/10+0 ms cpu, 27->27->18 MB, 28 MB goal, 4 P gc 8 @0.326s 6%: 0+11+0.50 ms clock, 0+7.5/11/9.0+2.0 ms cpu, 26->26->26 MB, 37 MB goal, 4 P gc 9 @0.586s 5%: 0+28+0 ms clock, 0+28/15/14+0 ms cpu, 54->54->37 MB, 55 MB goal, 4 P gc 10 @0.698s 6%: 0+66+1.0 ms clock, 0+0/57/9.0+4.0 ms cpu, 55->56->51 MB, 75 MB goal, 4 P gc 11 @1.191s 5%: 0+65+1.0 ms clock, 0+65/34/31+4.0 ms cpu, 104->104->75 MB, 105 MB goal, 4 P gc 12 @1.576s 4%: 0+111+1.0 ms clock, 0+0/4.5/109+4.0 ms cpu, 117->118->72 MB, 150 MB goal, 4 P gc 13 @2.396s 4%: 0+124+0.50 ms clock, 0+124/57/66+2.0 ms cpu, 173->173->150 MB, 174 MB goal, 4 P gc 14 @2.532s 6%: 0+178+0.54 ms clock, 0+62/178/0+2.1 ms cpu, 212->214->213 MB, 300 MB goal, 4 P gc 15 @4.635s 6%: 0.49+316+1.0 ms clock, 1.9+316/172/143+4.0 ms cpu, 437->437->300 MB, 438 MB goal, 4 P gc 16 @4.978s 8%: 0+553+0.47 ms clock, 0+115/553/0+1.9 ms cpu, 424->430->426 MB, 601 MB goal, 4 P Ready. Loaded 1002662 serialsgc 17 @6.215s 14%: 0+0+303 ms clock, 0+115/553/0+1215 ms cpu, 452->452->283 MB, 452 MB goal, 4 P (forced) GC took 304.8374ms GC forced gc 18 @126.525s 0%: 0+525+0.53 ms clock, 0+0/525/3.0+2.1 ms cpu, 283->283->283 MB, 566 MB goal, 4 P scvg0: inuse: 335, idle: 227, sys: 563, released: 0, consumed: 563 (MB)

genez commented 8 years ago

I am going to try with Go tip tomorrow and let you know the results

genez commented 8 years ago

tested with Go tip

go version devel +2168f2a Thu May 26 16:27:24 2016 +0000 windows/amd64

same results (client side): min: 0.451400 | avg: 0.873292 | max: 218.594800

I am attaching gctrace file

thanks gctrace.txt

ianlancetaylor commented 8 years ago

CC @aclements

RLH commented 8 years ago

By client side I'm assuming we are talking about end to end, which is what matters. I'm also assuming the trace is from the server and the numbers are in ms.

From the gctrace I notice this line

gc 17 @5.964s 10%: 0+0+235 ms clock, 0+275/166/108+942 ms cpu, 456->456->283 MB, 456 MB goal, 4 P (forced)

indicating a 235ms STW pause. The "(forced)" annotation means that GC 17 is the result of a call to runtime.GC() which does a STW collection instead of a concurrent GC. Removing the runtime.GC() call from server code will eliminate that STW pause which may be the problem. If the client pauses could be matched with the server GC pauses that would help confirm it is a GC issue. It says "Ready" just before the call to runtime.GC(), what does that mean?

From the trace it also looks like the code spends 5 seconds populating the in-(go)heap database. This can be deduced by looking at the heap sizes which reach 456MB at GC 17 and then the server goes into a steady state where no GC activity happens between 5.96 seconds and 34 .93 seconds.

Let us know if removing the runtime.GC() call helps and if the client pauses are coordinated with the server GC times. Finally if it is easy to calculate 95, 99 and 99.9 percentiles on the client that would also be interesting.

genez commented 8 years ago

Let me clarify a bit more what's happening at client side:

testclient is run with GOGC=off just to make sure that the delay is not related to client GC
server is started with GOGC=100 and GODEBUG=gctrace=1.
when server starts up, it reads 1 million "objects" from a SQLITE db into 5 distinct maps (map[[20]byte]FullCode) One map is "global" with all codes, the other maps are a subset (distinct by Type)

objects are structured extactly like this:

type FullCode struct { Type int Code string Status int Children map[[20]byte]FullCode }

after startup, my server now performs a manual GC

func timeGC() time.Duration { start := time.Now() runtime.GC() return time.Since(start) }

At this stage, testclient program starts and begins sending UDP requests to the server. after a few seconds I can see that cilent detects RTT > 200ms and exactly at the same time the server prints out a GC trace line.

The trace line I am talking about is this

gc 18 @126.525s 0%: 0+525+0.53 ms clock, 0+0/525/3.0+2.1 ms cpu, 283->283->283 MB, 566 MB goal, 4 P

I can record a video and upload it to youtube to better explain the situation if it's ok for you

aclements commented 8 years ago

This is almost certainly #10345 (which is a generalization of #9477), given the map with 1 million objects. @genez, would you be able to test a small patch to the runtime? There's an easy way to confirm or deny the connection with #10345 if we can tweak the runtime.

dlsniper commented 8 years ago

Hi @RLH and @aclements I've tested this out for a bit yesterday as it came out on slack. The issue doesn't happen with 200k or 300k items in the maps. As soon as this gets bigger then during the test there will be a pause at some moment when the GC runs automatically. With lower count of items, the pause varies from 1-2 ms to 25 ms but it's within ok limits. For me it looks like #9477 .

Since the maps in this case are allocated once and the elements are not deleted then would it could we maybe get a way to track if the state is changed between GC runs so that this doesn't need to do anything during the STW phase? Something like: the GC computes the hash of the map while running the the parallel phase and in the STW phase it compares the hash from the previous one with the current one and if they are different they then it doesn't do anything for it?

genez commented 8 years ago

@aclements yes I can test the patch, just let me know how to apply it

Thanks

aclements commented 8 years ago

@dlsniper, just to be clear on the terminology, this issue isn't about the STW phase, it's about mutator availability. According to the gctrace, the STW phases are all sub-millisecond. If this is #9477, the problem is that the process of scanning a single object, even during concurrent marking, is currently non-interruptible on the particular thread it's running on. Other threads will continue to make progress, though they might also get caught up scanning large single objects. Of course, this is still a GC-caused latency problem; it's just not a problem in the STW phase. :)

It's not safe to exempt the map from GC. You would also have to exempt everything reachable from that map.

@genez, great. I'll cook up a patch.

RLH commented 8 years ago

The gctrace line

gc 18 @126.525s 0%: 0+525+0.53 ms clock, 0+0/525/3.0+2.1 ms cpu, 283->283->283 MB, 566 MB goal, 4 P

says that the concurrent mark phase takes 525 + 0.52 ms (wall) clock and uses 0/525/3 + 2.1 ms CPU clock. Since we have 4 P (cpus) then it looks like the concurrent mark phase is using exactly 1/4 of the CPUs and the STW mark termination phase is using all of the CPUs but for only .52 (wall) ms or .52*4 (2.1) ms CPU time. This would seem to leave the expected 75% of the total CPU for the mutator during the concurrent mark phase. All this matches up with what we expect to see.

How many requests are in flight when the GC is being run and are all of them delayed? How many are in flight when the GC is not running? Are requests handled FIFO that might result in a convoy of requests? Even if a requests gets backed up behind a long mark involving a map other goroutines should make progress since there is 75% CPU availability. If other goroutines / requests are making progress then perhaps this isn't a global mutator availability problem but rather a problem with the GC conspiring with the scheduler and probability to starve one of the goroutines.

aclements commented 8 years ago

@genez, the test patch is https://go-review.googlesource.com/c/23512/. To pull it in to your local Go clone (which I assume you have, since you reported the issue still happens on tip), run git fetch https://go.googlesource.com/go refs/changes/12/23512/1 && git checkout FETCH_HEAD. Run it with GODEBUG=gctrace=1 and it should print "long scanobject" messages with some useful details when things go wrong.

genez commented 8 years ago

I applied the patch and recompiled Go tip

Response time measured from the client application (after 1 million requests) has changed dramatically: min: 0.918100 | avg: 1.007302 | max: 73.195100

average and minimum are now a bit higher than with previous runtime but the max is less than 30% a big improvement!!

please check attached execution log with gctrace enabled gctrace_with_patch.txt

regarding the questions from @RLH

The logic is strictly sequential: request-response no parallelism at the moment the server application does handle the incoming request in a goroutine but there's only 1 active request at a time if you only start 1 client

I hope this answers all of your questions

Please remember that in my dropbox shared folder you can find server and client source code + resources file needed to run the applications Maybe they can be of some value for you

aclements commented 8 years ago

@genez, I have another patch for you to try if you don't mind: https://go-review.googlesource.com/23540 (It's probably easier for you to try, since I don't have your code set up; if this doesn't help I'll look into locally reproducing this with the code you posted.) Like before, use git fetch https://go.googlesource.com/go refs/changes/40/23540/2 && git checkout FETCH_HEAD to pull it into your local Go clone.

genez commented 8 years ago

Just tried latest patch: "HEAD is now at 6632631... runtime: bound scanobject to ~100 µs"

client side max jumped after a few seconds from 2ms to 52ms: (two susbequent lines of client log) min: 0.464200 | avg: 0.867400 | max: 2.005700 min: 0.464200 | avg: 0.870495 | max: 52.144300

server printed this line: gc 21 @28.678s 1%: 1.0+108+0 ms clock, 4.0+0/53/54+0 ms cpu, 344->344->187 MB, 374 MB goal, 4 P

You can find attached gctrace file gctrace_with_patch_2.txt

Regarding my source, I was only suggesting to look at the code. I don't mind testing your patches, as this is a very important task for me

Thank you

genez commented 8 years ago

Sorry guys. I'm afraid I was testing against a modified server source with a simplified FullCode struct it didn't contain one member, the Code string, it was commented out for one of my tests.

I have just re-run the test with patch 2 with the correct server source (almost identical to original source, just with manual GC collection and loaded items count)

client still shows >200ms latency min: 0.463700 | avg: 0.879956 | max: 225.606500

server shows gc 18 @38.937s 1%: 1.0+447+2.0 ms clock, 4.0+0/232/221+8.0 ms cpu, 494->496->284 MB, 566 MB goal, 4 P

I am terribly sorry for the mistake I am going to make a copy of the sources just to be sure not to do the same error again

This shows a big difference in timing if I include a string member in my struct, or not. Maybe this can help you with your investigations

please check updated gctrace log gctrace_with_patch_2_correct_source.txt

genez commented 8 years ago

I added a short video demonstrating the issue to my dropbox shared folder that contain sources and resrouce files:

https://www.dropbox.com/sh/e2timsyjifvp61i/AAAmDoKSP7NANjDlsYcGcBvGa?dl=0

(server is slowed down by the screen capture software) Maybe it can be of some help

aclements commented 8 years ago

@genez, thanks for trying the second patch. Could you try another run with GOMAXPROCS=3? You might also need patch 2.

It's possible this is a scheduler issue where the one P that's running your one goroutine becomes a dedicated mark worker when GC starts and your goroutine winds up stuck on the run queue for the whole GC cycle. GOMAXPROCS=3 will eliminate that dedicated mark worker.

genez commented 8 years ago

Thank you @aclements Just tested with GOMAXPROCS=3

Client side: no effects I hope the gctrace file will show you something useful gctrace_GOMAXPROC_3.txt

aclements commented 8 years ago

Interesting. So, I think I've reproduced the problem in a small(ish) test case. The trace of that test case does confirm my hypothesis that the garbage collector is taking over the one P that's running user code and for some reason the user goroutine isn't getting moved or stolen. I thought GOMAXPROCS=3 would mitigate this, but it turns out it just changes its nature: the fractional worker only gets 75% of a P, but that's still enough to starve the user goroutine on that P for long stretches. The trace also confirms another thing the gctraces are showing: we're often not spinning up idle mark workers on the idle Ps.

genez commented 8 years ago

ok please let me know if I can help you in any way

gianpaolog commented 8 years ago

@aclements I am very interested in the solution of this issue, do you think it could be solved as of 1.7?

aclements commented 8 years ago

@genez, what happens if you run with GOMAXPROCS=1?

@gianpaolog, unfortunately the issue involved is quite subtle and despite understanding it (I think), I'm not actually sure how to fix it, which suggests any fix would be too risky for this point in the 1.7 development cycle. However, from what I understand, it affects applications with very few goroutines the most (e.g., one active goroutine at a time) and I believe running with GOMAXPROCS=1 should work around the issue without significantly affecting performance (assuming the application doesn't go through phases where there are many active goroutines).

genez commented 8 years ago

@aclements: I have just fetched gotip sources from scratch and re-applied your two patches as per your instructions. running the server with GOMAXPROCS=1 dramatically improves max latency, that now tops to 43ms after 3 millions requests (with a mean value of 0.63ms). unfortunately still not enough for my requirements of 10ms max latency.

regarding your assumptions, you're right: I will eventually end up having more less 10 active clients in a real world scenario, so we are talking about a very limited number of goroutines. also, throughput is not my main concern (but latency is)

gianpaolog commented 8 years ago

Waiting for a 1.7.1 could work for me, but I would like to understand if you think it is feasible to solve the bug and provide a latency of a few ms (the expectation from the talks is actually around 3-5ms, in the scenario of less than 2GB of memory managed by GC).

aclements commented 8 years ago

@genez, thanks for trying that out. In general, it's more useful to know high percentile latency rather than absolute max latency. Unless you're using a system that's hard real time from top to bottom, including the OS kernel, there are a lot of things that can cause occasional high latency. Do you know what your, say, 99%ile or 99.9%ile latency is?

@gianpaolog, based on @genez's report, it sounds like setting GOMAXPROCS=1 is a viable workaround for this issue. We generally only do point releases for crash bugs or serious performance issues that can't be worked around, so this wouldn't qualify.

This particular issue is at an unfortunate cross-roads: our sub-10ms design goal has been for global STW time; so far we've focused much less on the impact to latency of individual goroutines where other goroutines can continue to execute in parallel. However, if an application only has one goroutine, then impacting a single goroutine's latency might as well be a global stop, which is what we're seeing in this issue. With 1.7, we've focused on rooting out the last sources of STW latency, so now that that should no longer be a primary source of latency, we're going to turn out attention to other sources of latency for the 1.8 cycle, including scheduler-induced latency like in this issue.

genez commented 8 years ago

@aclements this is only a test, the real software is going to run in a soft-realtime environment (including a PREEMPT_RT patched kernel, no paging, no disk access, and other well known best practices for soft realtime systems)

Anyway I would not focus on this, as I implemented the very same server in other languages (including C/C++, Python and Java) and then tested them in the same conditions (with the same Go client) For example Java scored a max latency of 37ms after 3 million requests (but with a considerably higher mean value of 1.50ms) Python does a better job with a 17ms max latency From the beginning I was expecting Go to outperform the other implementations and score something near to the C/C++ implementation (5ms max latency)

I am now going to improve the test client to plot percentiles and will let you know the results

gianpaolog commented 8 years ago

@aclements in my tests, I am experiencing peaks in latency (about 200ms) on average in 1 sample out of 10,000. The peaks totally disappear if I use GOGC=off. Since this service is working on an industrial production line, the line stops every few minutes due to this latency. Any chance to have this issue fixed for 1.8?

dr2chase commented 8 years ago

fyi, Austin's herding cats this week at a conference, so he may not get back to you immediately.

Do you understand the distinction Austin is making between GC stop-the-world latency, and the possibility of a concurrent GC "accidentally" stealing too much of a worker goroutine's forward progress and thus causing whole-system latency?

gianpaolog commented 8 years ago

Yes: not a STW issue, but Austin was speaking about a concurrency issue that could be solved, anyhow.

gianpaolog commented 8 years ago

In the end of the day, users are experiencing high latency peaks due to GC behaviour, and this is a big limitation for a number of applications. In my case, I would not be able to replace the existing C service with Go.

RLH commented 8 years ago

This is certainly on the list for 1.8. To get us to focus on it early and often all we need is a simple example that we can run locally.

On Tue, Jun 21, 2016 at 10:39 AM, Gianpaolo Gasparini < notifications@github.com> wrote:

@aclements https://github.com/aclements in my tests, I am experiencing peaks in latency (about 200ms) on average in 1 sample out of 10,000. The peaks totally disappear if I use GOGC=off. Since this service is working on an industrial production line, the line stops every few minutes due to this latency. Any chance to have this issue fixed for 1.8?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/15847#issuecomment-227460627, or mute the thread https://github.com/notifications/unsubscribe/AA7Wn7oemnpU3BjLjWaRqkLskIGUJOGxks5qN_eggaJpZM4Inih6 .

gianpaolog commented 8 years ago

@RLH here you are: https://www.dropbox.com/sh/693jdhz2uef061y/AAAwV3hTVFvYfkk1Im5pvnb_a?dl=0

Instructions: Once unzipped L1.7z (contains a text file with sample data): 1) run ets3 (the server listening on UDP port 4420) 2) run "./client 127.0.0.1:4420 L1"

L1 file contains about 1 million of fake messages to be sent to the server. The average answer takes around 0.3ms, but every 8-10k samples, you'll see a major latency.

RLH commented 8 years ago

Thanks, I was able to reproduce the issue locally.

On Tue, Jun 21, 2016 at 4:42 PM, Gianpaolo Gasparini < notifications@github.com> wrote:

@RLH https://github.com/RLH here you are: https://www.dropbox.com/sh/693jdhz2uef061y/AAAwV3hTVFvYfkk1Im5pvnb_a?dl=0

Instructions: Once unzipped L1.7z (contains a text file with sample data): 1) run ets3 (the server listening on UDP port 4420) 2) run "./client 127.0.0.1:4420 L1"

L1 file contains about 1 million of fake messages to be sent to the server. The average answer takes around 0.3ms, but every 8-10k samples, you'll see a major latency.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/15847#issuecomment-227565210, or mute the thread https://github.com/notifications/unsubscribe/AA7Wn93dMCWFDlBayLBA7tzYGwZ2kXpeks5qOEyZgaJpZM4Inih6 .

gianpaolog commented 8 years ago

Hi @RLH , just to let you know that we repeated the test on the same machine with .Net Core. We had exactly 2 peaks in the beginning of the test (loading something?) around 12ms, while the rest of the million samples were stable around 0.3ms. Please don't tell me I must use the old C#... ;-)

genez commented 8 years ago

just an update, I am following the issue #16293 and I was thinking that the change done to the GC could be good for this issue aswell. I am now re-running my tests with up-to-date master go source and absolute maximum latency has dropped from 200ms to 21ms In 1 million request/reply cycles I only had 5 occurrences over 20ms and 21 occurrences between 10ms and 20ms. this is a great improvement!

aclements commented 8 years ago

@genez, that's good to hear! Note that the fix for issue #16293 is the same as the patch you tried earlier (CL 23540) and reported didn't fix the issue. However, it's possible that combined with some other fixes on master helped.

I'm still planning to fix the problem with idle mark workers running when there is actual work to do, which was one of the root causes we worked out in this issue.

quentinmit commented 8 years ago

@aclements: Should this issue be retitled to be specifically about idle mark workers and moved out of Go1.8Early?

aclements commented 8 years ago

Closing, since @genez's original application seems to be doing much better. We'll track the remaining issue with idle mark workers running when there's real work to do at #16528.

golang / go

runtime: >200ms GC latency in test programs (calling runtime.GC?) #15847