grafana / metrictank

metrics2.0 based, multi-tenant timeseries store for Graphite and friends.
GNU Affero General Public License v3.0
623 stars 105 forks source link

MT cpu spike due to GC -> requests can easily take >5s to get answered #172

Open Dieterbe opened 8 years ago

Dieterbe commented 8 years ago

i'm going to look into techniques to lower GC cpu overhead. we currently reference a lot of data through pointers, i suspect we may be able to lower GC quite a bit by being smarter about this.

Dieterbe commented 8 years ago

test with https://gist.github.com/Dieterbe/bda3f2af50c56146e98580a03c2b6eaa applied to raintank-docker to auto-apply realistic workload results in: sys https://snapshot.raintank.io/dashboard/snapshot/1zc8flsQTV4pyjOv6fIm5BXH3eird4kF MT https://snapshot.raintank.io/dashboard/snapshot/hvtuSiLV0CDtJy31zWdDKQ2ZQWQOW1VI (GC~spikes correlation visible on duration chart)

vegeta:

cat attack.out | vegeta report 2>&1 | egrep -v 'connection reset|timed out|timeout'
Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    5m26.855960507s, 4m59.994999853s, 26.860960654s
Latencies     [mean, 50, 95, 99, max]  17.006800796s, 13.661618151s, 43.008771756s, 53.01141618s, 2m7.401387905s
Bytes In      [total, mean]            1756638977, 29277.32
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  68.44%
Status Codes  [code:count]             200:41066  0:18934  
Error Set:
root@benchmark:/opt/raintank/raintank-tsdb-benchmark# cat attack.out | vegeta report 2>&1 | egrep -c 'connection reset|timed out|timeout'
2163

new sys https://snapshot.raintank.io/dashboard/snapshot/i9TIko5tB522Wh8RQVgR7BG6BZmjmFna new MT https://snapshot.raintank.io/dashboard/snapshot/wFketZbpnZUZjxg1QbIZmNNJgSoJdnUn

vegeta:

cat vegeta-after
root@benchmark:/opt/raintank/raintank-tsdb-benchmark# cat attack.out | vegeta report 2>&1 | egrep -v 'connection reset|timed out|timeout'
Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    5m42.394862196s, 4m59.994999882s, 42.399862314s
Latencies     [mean, 50, 95, 99, max]  17.811976008s, 14.105108138s, 43.010377182s, 53.013462631s, 1m14.607911294s
Bytes In      [total, mean]            1677464219, 27957.74
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  66.55%
Status Codes  [code:count]             200:39932  0:20068  
Error Set:
root@benchmark:/opt/raintank/raintank-tsdb-benchmark# cat attack.out | vegeta report 2>&1 | egrep -c 'connection reset|timed out|timeout'
2253

=> my test was probably using too many req/s or something. it seemed graphite-api itself had issues keeping up, however we can still tell what we need to tell: => no discernable change. similar latency spikes at GC runs

Dieterbe commented 8 years ago

confirmed again using latest golang master, which includes Austin's fix.

Dieterbe commented 7 years ago

latest master has GC changes that should help

Dieterbe commented 7 years ago

a fix was merged in Go for https://github.com/golang/go/issues/16293 : https://github.com/golang/go/commit/cf4f1d07a189125a8774a923a3259126599e942b , this has shown good results for large maps (see also https://github.com/spion/hashtable-latencies/issues/13). It will likely fix our issue as well. We just need to test it. Only problem is it's in git master, and there most likely won't be a 1.7.x release for it so we have to use go from git master and/or wait for 1.8

dgryski commented 7 years ago

Is it reasonable to cherry-pick that fix onto 1.7.1 ?

Dieterbe commented 7 years ago

i'll just run a bench in raintank-docker. now is especially a good time because of https://groups.google.com/forum/m/#!topic/golang-dev/Ab1sFeoZg_8 also

Dieterbe commented 7 years ago

https://github.com/golang/go/issues/14812