grafana / metrictank

metrics2.0 based, multi-tenant timeseries store for Graphite and friends.
GNU Affero General Public License v3.0
623 stars 105 forks source link

MT cpu spike due to GC -> requests can easily take >5s to get answered #172

Open Dieterbe opened 8 years ago

Dieterbe commented 8 years ago

i'm going to look into techniques to lower GC cpu overhead. we currently reference a lot of data through pointers, i suspect we may be able to lower GC quite a bit by being smarter about this.

Dieterbe commented 8 years ago

test with applied to raintank-docker to auto-apply realistic workload results in: sys MT (GC~spikes correlation visible on duration chart)


cat attack.out | vegeta report 2>&1 | egrep -v 'connection reset|timed out|timeout'
Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    5m26.855960507s, 4m59.994999853s, 26.860960654s
Latencies     [mean, 50, 95, 99, max]  17.006800796s, 13.661618151s, 43.008771756s, 53.01141618s, 2m7.401387905s
Bytes In      [total, mean]            1756638977, 29277.32
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  68.44%
Status Codes  [code:count]             200:41066  0:18934  
Error Set:
root@benchmark:/opt/raintank/raintank-tsdb-benchmark# cat attack.out | vegeta report 2>&1 | egrep -c 'connection reset|timed out|timeout'

new sys new MT


cat vegeta-after
root@benchmark:/opt/raintank/raintank-tsdb-benchmark# cat attack.out | vegeta report 2>&1 | egrep -v 'connection reset|timed out|timeout'
Requests      [total, rate]            60000, 200.00
Duration      [total, attack, wait]    5m42.394862196s, 4m59.994999882s, 42.399862314s
Latencies     [mean, 50, 95, 99, max]  17.811976008s, 14.105108138s, 43.010377182s, 53.013462631s, 1m14.607911294s
Bytes In      [total, mean]            1677464219, 27957.74
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  66.55%
Status Codes  [code:count]             200:39932  0:20068  
Error Set:
root@benchmark:/opt/raintank/raintank-tsdb-benchmark# cat attack.out | vegeta report 2>&1 | egrep -c 'connection reset|timed out|timeout'

=> my test was probably using too many req/s or something. it seemed graphite-api itself had issues keeping up, however we can still tell what we need to tell: => no discernable change. similar latency spikes at GC runs

Dieterbe commented 8 years ago

confirmed again using latest golang master, which includes Austin's fix.

Dieterbe commented 7 years ago

latest master has GC changes that should help

Dieterbe commented 7 years ago

a fix was merged in Go for : , this has shown good results for large maps (see also It will likely fix our issue as well. We just need to test it. Only problem is it's in git master, and there most likely won't be a 1.7.x release for it so we have to use go from git master and/or wait for 1.8

dgryski commented 7 years ago

Is it reasonable to cherry-pick that fix onto 1.7.1 ?

Dieterbe commented 7 years ago

i'll just run a bench in raintank-docker. now is especially a good time because of!topic/golang-dev/Ab1sFeoZg_8 also

Dieterbe commented 7 years ago