ma6174 / blog

138 stars 18 forks source link

Go GC问题 #5

Open ma6174 opened 10 years ago

ma6174 commented 10 years ago




这样看起来似乎问题没法继续分析解答了。就在这时,服务B出现了一条500的日志,看错误信息是too many open files,一般出现这种情况是因为服务B自己文件句柄数超了,当超过6000的时候就比较容易出现上面的错误。这里的files不一定是真实的文件,有可能是tcp连接,因为在linux下一切皆是文件。先写个程序监测一些服务B的文件句柄数吧,检查一个进程的文件句柄数可以用lsof -n -p pid | wc -l来观测,pid就是进程的pid。写个shell脚本循环观测:

while true
        echo -n `date`"\t"
        lsof -n -p `pidof service_name` | wc -l
        sleep 0.5


Tue Jul 22 19:42:51 CST 2014    2715
Tue Jul 22 19:42:52 CST 2014    2713
Tue Jul 22 19:42:53 CST 2014    2711
Tue Jul 22 19:42:53 CST 2014    2711
Tue Jul 22 19:42:54 CST 2014    2708
Tue Jul 22 19:42:55 CST 2014    10177
Tue Jul 22 19:42:59 CST 2014    3103
Tue Jul 22 19:43:00 CST 2014    2722
Tue Jul 22 19:43:00 CST 2014    2719

从上面的结果我们很容易看出来,在19:42:55文件句柄数从2708突然升到了10177,随后很快恢复。之前说过,当文件句柄数超过6000就有可能会出现too many open files错误。这里文件句柄数都升到1w了。继续观察fd(文件句柄)突然升高的频率,发现在网络高峰期大约每2~3分钟就会出现一次。是不是因为fd突然升高导致慢请求呢?继续分析服务A的日志,发现确实在fd升高的时候会出现慢请求。如果是这样的话,是不是就能认为服务B在那个时间点有太多请求了导致从服务A新来的连接无法建立而出现慢请求?这只是一种猜测,当然还有一个更大的疑惑需要答案:服务B的fd为什么会突然升高?



while true
        echo `date`
        curl -o /dev/null > /dev/null 2>&1
        sleep 0.2


Tue Jul 22 18:26:09 CST 2014
Tue Jul 22 18:26:09 CST 2014
Tue Jul 22 18:26:09 CST 2014
Tue Jul 22 18:26:09 CST 2014
Tue Jul 22 18:26:14 CST 2014
Tue Jul 22 18:26:14 CST 2014
Tue Jul 22 18:26:14 CST 2014



Tue Jul 22 19:44:52 CST 2014    2708        │Tue Jul 22 19:44:57 CST 2014
Tue Jul 22 19:44:52 CST 2014    2710        │Tue Jul 22 19:44:57 CST 2014
Tue Jul 22 19:44:53 CST 2014    2707        │Tue Jul 22 19:44:58 CST 2014
Tue Jul 22 19:44:54 CST 2014    2709        │Tue Jul 22 19:44:58 CST 2014
Tue Jul 22 19:44:54 CST 2014    2729        │Tue Jul 22 19:44:58 CST 2014
Tue Jul 22 19:44:55 CST 2014    2707        │Tue Jul 22 19:44:58 CST 2014
Tue Jul 22 19:44:56 CST 2014    2709        │Tue Jul 22 19:45:03 CST 2014
Tue Jul 22 19:44:56 CST 2014    2714        │Tue Jul 22 19:45:03 CST 2014
Tue Jul 22 19:44:57 CST 2014    2718        │Tue Jul 22 19:45:03 CST 2014
Tue Jul 22 19:44:58 CST 2014    2710        │Tue Jul 22 19:45:03 CST 2014
Tue Jul 22 19:44:58 CST 2014    2713        │Tue Jul 22 19:45:04 CST 2014
Tue Jul 22 19:44:59 CST 2014    2713        │Tue Jul 22 19:45:04 CST 2014
Tue Jul 22 19:45:02 CST 2014    10219       │Tue Jul 22 19:45:04 CST 2014
Tue Jul 22 19:45:03 CST 2014    2877        │Tue Jul 22 19:45:04 CST 2014
Tue Jul 22 19:45:04 CST 2014    2721        │Tue Jul 22 19:45:04 CST 2014
Tue Jul 22 19:45:05 CST 2014    2710        │Tue Jul 22 19:45:05 CST 2014
Tue Jul 22 19:45:05 CST 2014    2710        │Tue Jul 22 19:45:05 CST 2014
Tue Jul 22 19:45:06 CST 2014    2714        │Tue Jul 22 19:45:05 CST 2014
Tue Jul 22 19:45:07 CST 2014    2716        │Tue Jul 22 19:45:05 CST 2014



  1. 拒绝接受请求
  2. 没有任何日志滚动



GOGCTRACE=1 /path/to/your/program


GODEBUG=gctrace=1 /path/to/your/program


gc1(1): 0+0+0 ms, 0 -> 0 MB 16 -> 18 (19-1) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc2(1): 0+0+0 ms, 0 -> 0 MB 29 -> 29 (30-1) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc3(1): 0+0+0 ms, 0 -> 0 MB 972 -> 747 (973-226) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc4(1): 0+0+0 ms, 0 -> 0 MB 1248 -> 904 (1474-570) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields

想了解每个字段的含义的话可以看 源码,这里我们关注的是GC在什么时候发生的,频率是多少,GC一次花多长时间,GC的效果可以看内存减少了多少,也可以看对象减少多少。

开启GC log之后,我们再去看日志,就会发现有这样的信息:

2014/07/25 11:03:28 app log...
gc183(8): 18+4043+495 ms, 32426 -> 16950 MB 205909094 -> 3045275 (23371853344-23368808069) objects, 60(1982) handoff, 71(105762) steal, 564/267/314 yields
2014/07/25 11:03:33 app log...


Fri Jul 25 11:03:26 CST 2014    2463      │Fri Jul 25 11:03:27 CST 2014
Fri Jul 25 11:03:26 CST 2014    2461      │Fri Jul 25 11:03:27 CST 2014
Fri Jul 25 11:03:27 CST 2014    2460      │Fri Jul 25 11:03:27 CST 2014
Fri Jul 25 11:03:28 CST 2014    2459      │Fri Jul 25 11:03:28 CST 2014
Fri Jul 25 11:03:29 CST 2014    2462      │Fri Jul 25 11:03:28 CST 2014
Fri Jul 25 11:03:31 CST 2014    2462      │Fri Jul 25 11:03:28 CST 2014
Fri Jul 25 11:03:31 CST 2014    9738      │Fri Jul 25 11:03:28 CST 2014
Fri Jul 25 11:03:34 CST 2014    2501      │Fri Jul 25 11:03:33 CST 2014
Fri Jul 25 11:03:35 CST 2014    2500      │Fri Jul 25 11:03:33 CST 2014
Fri Jul 25 11:03:36 CST 2014    2492      │Fri Jul 25 11:03:34 CST 2014
Fri Jul 25 11:03:36 CST 2014    2493      │Fri Jul 25 11:03:34 CST 2014
Fri Jul 25 11:03:37 CST 2014    2494      │Fri Jul 25 11:03:34 CST 2014
Fri Jul 25 11:03:38 CST 2014    2490      │Fri Jul 25 11:03:35 CST 2014
Fri Jul 25 11:03:38 CST 2014    2490      │Fri Jul 25 11:03:35 CST 2014
Fri Jul 25 11:03:39 CST 2014    2490      │Fri Jul 25 11:03:35 CST 2014
Fri Jul 25 11:03:40 CST 2014    2489      │Fri Jul 25 11:03:35 CST 2014
Fri Jul 25 11:03:40 CST 2014    2493      │Fri Jul 25 11:03:36 CST 2014





Hello everyone, 

Our business suffered from an annoying problem. We are developing an 
iMessage-like service in Go, the server can serves hundreds of 
thousands of concurrent TCP connection per process, and it's robust 
(be running for about a month), which is awesome. However, the process 
consumes 16GB memory quickly, since there are so many connections, 
there are also a lot of goroutines and buffered memories used. I 
extend the memory limit to 64GB by changing runtime/malloc.h and 
runtime/malloc.goc. It works, but brings a big problem too - The 
garbage collecting process is then extremely slow, it stops the world 
for about 10 seconds every 2 minutes, and brings me some problems 
which are very hard to trace, for example, when stoping the world, 
messages delivered may be lost. This is a disaster, since our service 
is a real-time service which requires delivering messages as fast as 
possible and there should be no stops and message lost at all. 

I'm planning to split the "big server process" to many "small 
processes" to avoid this problem (smaller memory footprint results to 
smaller time stop), and waiting for Go's new GC implementation. 

Or any suggestions for me to improve our service currently? I don't 
know when Go's new latency-free garbage collection will occur. 




Hello everyone, 

Thanks for all your help, I updated our Go version to: 

go version devel +852ee39cc8c4 Mon Nov 19 06:53:58 2012 +1100 

and rebuilt our servers, now GC duration reduced to 1~2 seconds, it's 
a big improvement! 
Thank contributors on the new GC! 



gc43(8): 44+21+49805+104 us, 3925 -> 7850 MB, 50135750 (1781280527-1731144777) objects, 549097/394298/0 sweeps, 117(9232) handoff, 69(2683) steal, 969/439/2061 yields

再看一下go 1.3的release note,确实对GC有优化:

Changes to the garbage collector

For a while now, the garbage collector has been precise when examining values in the heap; the Go 1.3 release adds equivalent precision to values on the stack. This means that a non-pointer Go value such as an integer will never be mistaken for a pointer and prevent unused memory from being reclaimed.

Starting with Go 1.3, the runtime assumes that values with pointer type contain pointers and other values do not. This assumption is fundamental to the precise behavior of both stack expansion and garbage collection. Programs that use package unsafe to store integers in pointer-typed values are illegal and will crash if the runtime detects the behavior. Programs that use package unsafe to store pointers in integer-typed values are also illegal but more difficult to diagnose during execution. Because the pointers are hidden from the runtime, a stack expansion or garbage collection may reclaim the memory they point at, creating dangling pointers.

Updating: Code that uses unsafe.Pointer to convert an integer-typed value held in memory into a pointer is illegal and must be rewritten. Such code can be identified by go vet


  1. 部署更多服务分担负载


  1. 引入对象池


  1. 修改GC并发数(未测试)


  1. 服务超时自动重试其他机器


zhwei commented 10 years ago

刚刚改版issues, 你这里更像博客了。。。

ma6174 commented 10 years ago

@zhwei 主界面有点调整,标题在左边,更大了,评论在右边,中间空着一大块不太好。。

ryancheung commented 9 years ago

好文啊,Go 新手学习了~

andyxning commented 8 years ago


defp commented 8 years ago

Go1.5 能保证stw在10ms内

hzj629206 commented 5 years ago

Go1.5 能保证stw在10ms内

10ms 还是太长了。本来一个简单的请求1ms 都用不了,结果大量 gc 导致10几 ms 的延迟。