new version blows up when opening for requests

it's the first time seeing this in prod, when deploying f040a9e5741ca16c84ebc763314f7b4391c98900 with agg-settings = 10min:6h:2:38d:true,2h:6h:2:120d:true, not sure it's because of these changes or because the load pattern has changed (increased) due to recent extra signups:

we went from 20k/s to 24k/s ingest
there's a lot more consistent requests for larger time ranges:

i've seen this twice:

once after being up for 16h but not getting requests, opening up requests (and disabling the other node for reads) -> instant mem+cpu spike and OOM
once with the other node still accepting reads and as soon as the 1h warmup period passed -> instant mem+cpu spike and OOM

MT dash: https://grafana-monitor.raintank.io/dashboard/snapshot/IWAPH7sho06f0n0TgBmdWsuQykwZfVmM

sys dash: https://grafana-monitor.raintank.io/dashboard/snapshot/Fv7Rb3G0yR5enyZF7oS2kF2iyMMaBPIW

will try to repro in QA hypothesis:maybe extensive GC workload maxing out cpu and blocking progress, but mem usage shouldn't be that high i think. gctrace will help.

no luck yet. in raintank-docker i'm trying to fill up a bunch of data ( up to 3 days back) and then suddenly launch a similar request workload as seen in prod:

+./fake_metrics_to_nsq -keys-per-org 100 -orgs 10 -statsd-addr statsdaemon:8125 -nsqd-tcp-address nsqd:4150 -offset 3d -speedup 400 -stop-at-now &> /var/log/raintank/fake.log
+./fake_metrics_to_nsq -keys-per-org 100 -orgs 10 -statsd-addr statsdaemon:8125 -nsqd-tcp-address nsqd:4150 &> /var/log/raintank/fake2.log &

followed by:

$ cat repro-prod-issue.sh
#!/bin/sh
./inspect-es -es-index metric -format vegeta-mt -from 3d -es-addr elasticsearch:9200 > mc.txt
./inspect-es -es-index metric -format vegeta-mt -from 9min -es-addr elasticsearch:9200 > m.txt
ls -alh mc.txt m.txt

cat mc.txt | sed 's#18763#6063#' | ./vegeta attack -duration 100s -rate 500 > mc.out &
cat m.txt | sed 's#18763#6063#'  | ./vegeta attack -duration 100s -rate 5000 > m.out &
echo "launched!"
sleep 110

but MT handles this just fine and memory only goes up by a gig or so. maybe cause i'm working with a smaller amount of metrics here (1k vs 500k in prod). it would take too long to on-demand backload a similar volume on my laptop and since there's no caching going on (at least not in MT) requesting the same data multiple times should work just as well. notably in prod:

all cpu cores went up to 50% strongly suggesting GC (https://snapshot.raintank.io/dashboard/snapshot/vsz4vUuRUdEoVuD80GP0cTM951CnnIsh)
also seeing high request handle durations and cassandra response times, but this could be due to cpu maxing out. (https://snapshot.raintank.io/dashboard/snapshot/xL0hmAdISWcCMHeYtwRD5Wi5GhA3EgX7) cassandra dash shows read latency avg of no higher than 5ms (real bummer it only shows averages)

maybe i can do a qa run with similar-to-prod metric volume.

now i'm trying something else: i suspect that maybe the index for the new graphite endpoint introduces too many new pointers and keeps the GC busier so using basically a steady instream like so:

+./fake_metrics_to_nsq -keys-per-org 100 -orgs 100 -statsd-addr statsdaemon:8125 -nsqd-tcp-address nsqd:4150 -offset 3h -speedup 10 -stop-at-now &> /var/log/raintank/fake.log
+./fake_metrics_to_nsq -keys-per-org 100 -orgs 100 -statsd-addr statsdaemon:8125 -nsqd-tcp-address nsqd:4150 &> /var/log/raintank/fake2.log &

and just running metric-tank with +GODEBUG=gctrace=1 ./metric_tank with a patched GC like so:

diff --git a/src/runtime/mgc.go b/src/runtime/mgc.go
index 94301c6..b0d3528 100644
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -1297,6 +1297,7 @@ func gcMarkTermination() {
                if work.mode != gcBackgroundMode {
                        print(" (forced)")
                }
+               print(" sw ", gcController.scanWork, " bgscancred ", gcController.bgScanCredit)
                print("\n")
                printunlock()
        }

latest master: sys dash: https://snapshot.raintank.io/dashboard/snapshot/D1A8qzn673G2v01syp32bxlZ516cH8qb MT dash: https://snapshot.raintank.io/dashboard/snapshot/j5hPeum00o8Itpa8234TYC8cV2bfKgkz

dd69883 (pre graphite endpoint): MT dash: https://snapshot.raintank.io/dashboard/snapshot/o8J1vnIUnjSNV11wmo2hw2uc10XHh7zA sys dash:https://snapshot.raintank.io/dashboard/snapshot/ta3trxmlqTPL1I0l1vaY4Z7LE82dOQP3

unfortunately all cpu/mem looks identical. i also compared the GC data using gcvis (patched for the 2 new params) and everything looks identical, also the timings :(

a85f3af (previous stable in prod, aka pkg-0.1.0-1457484036ubuntu1) MT: https://snapshot.raintank.io/dashboard/snapshot/KU1EmofIUNT6EusRvPrbNkvwvIIcwF5v sys: https://snapshot.raintank.io/dashboard/snapshot/1jDols6042Vx2CLPotc5QQqt2NRvpa1T looks the same again.for the record, attached the gc view which also looks the same

been doing more testing using a new mt4 instance that's not in the pool and that has the "broken" MT installed. then gor to replicate prod traffic to mt4

dieter@metric-tank-2-prod:~$ sudo ./gor --input-raw :18763 --output-http metric-tank-4-prod:18763

also ran loops like

while true ;do let i=i+1; echo $i; curl 'http://localhost:18763/debug/pprof/profile?seconds=1' > mt4-cpu-$i; done
while true ;do let i=i+1; echo $i; curl http://localhost:18763/debug/pprof/heap > mt4-heap-$i; done

to get profiles right before crash. unfortunately those don't seem to show anything interesting.

~ ❯❯❯ go tool pprof --inuse_space mt4-bin mt4-heap-2119
Entering interactive mode (type "help" for commands)
(pprof) top50 -cum
2800.22MB of 2823.27MB total (99.18%)
Dropped 195 nodes (cum <= 14.12MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%  2823.27MB   100%  runtime.goexit
         0     0%     0%  1844.66MB 65.34%  github.com/nsqio/go-nsq.(*Consumer).handlerLoop
         0     0%     0%  1844.66MB 65.34%  main.(*Handler).HandleMessage
      27MB  0.96%  0.96%  1093.60MB 38.74%  main.(*AggMetrics).GetOrCreate
  760.09MB 26.92% 27.88%  1066.60MB 37.78%  main.NewAggMetric
         0     0% 27.88%   968.06MB 34.29%  main.main
         0     0% 27.88%   968.06MB 34.29%  runtime.main
  104.01MB  3.68% 31.56%   900.58MB 31.90%  main.NewAggregator
         0     0% 31.56%   709.56MB 25.13%  main.(*AggMetric).Add
         0     0% 31.56%   586.57MB 20.78%  main.(*DefCache).Backfill
         0     0% 31.56%   586.57MB 20.78%  main.NewDefCache
         0     0% 31.56%   530.07MB 18.78%  github.com/raintank/raintank-metric/metricdef.(*DefsEs).GetMetrics
   85.51MB  3.03% 34.59%   530.07MB 18.78%  github.com/raintank/raintank-metric/schema.MetricDefinitionFromJSON
  177.51MB  6.29% 40.88%   516.03MB 18.28%  main.NewChunk
         0     0% 40.88%   476.52MB 16.88%  main.(*AggMetric).addAggregators
         0     0% 40.88%   476.52MB 16.88%  main.(*Aggregator).Add
         0     0% 40.88%   476.52MB 16.88%  main.(*Aggregator).flush
         0     0% 40.88%   444.56MB 15.75%  encoding/json.Unmarshal
         0     0% 40.88%   439.56MB 15.57%  encoding/json.(*decodeState).object
         0     0% 40.88%   439.56MB 15.57%  encoding/json.(*decodeState).unmarshal
         0     0% 40.88%   439.56MB 15.57%  encoding/json.(*decodeState).value
  381.48MB 13.51% 54.39%   381.48MB 13.51%  main.NewCassandraStore
  338.52MB 11.99% 66.38%   338.52MB 11.99%  github.com/dgryski/go-tsz.New
  203.51MB  7.21% 73.59%   203.51MB  7.21%  fmt.Sprintf
      26MB  0.92% 74.51%   191.53MB  6.78%  github.com/dgryski/go-tsz.(*Series).Push
         0     0% 74.51%   191.53MB  6.78%  main.(*Chunk).Push
         0     0% 74.51%   181.55MB  6.43%  reflect.Value.SetMapIndex
  181.55MB  6.43% 80.94%   181.55MB  6.43%  reflect.mapassign
  167.53MB  5.93% 86.87%   167.53MB  5.93%  github.com/dgryski/go-tsz.(*bstream).writeBits
         0     0% 86.87%   158.51MB  5.61%  encoding/json.(*decodeState).literal
  158.51MB  5.61% 92.49%   158.51MB  5.61%  encoding/json.(*decodeState).literalStore
         0     0% 92.49%       86MB  3.05%  encoding/json.(*decodeState).array
   56.50MB  2.00% 94.49%    56.50MB  2.00%  main.(*DefCache).Backfill.func1
         0     0% 94.49%    48.50MB  1.72%  reflect.MakeSlice
   48.50MB  1.72% 96.21%    48.50MB  1.72%  reflect.unsafe_NewArray
         0     0% 96.21%    35.50MB  1.26%  reflect.MakeMap
   35.50MB  1.26% 97.47%    35.50MB  1.26%  reflect.makemap
         0     0% 97.47%       33MB  1.17%  github.com/raintank/raintank-metric/msg.(*MetricData).DecodeMetricData
    1.50MB 0.053% 97.52%       33MB  1.17%  github.com/raintank/raintank-metric/schema.(*MetricDataArray).UnmarshalMsg
    1.50MB 0.053% 97.57%    31.50MB  1.12%  github.com/raintank/raintank-metric/schema.(*MetricData).UnmarshalMsg
      30MB  1.06% 98.63%       30MB  1.06%  github.com/tinylib/msgp/msgp.ReadStringBytes
         0     0% 98.63%    15.50MB  0.55%  reflect.Value.Convert
   15.50MB  0.55% 99.18%    15.50MB  0.55%  reflect.cvtBytesString
(pprof) 
~ ❯❯❯ go tool pprof mt4-bin mt4-cpu-60
Entering interactive mode (type "help" for commands)
(pprof) top50 -cum
3.52s of 3.54s total (99.44%)
Dropped 6 nodes (cum <= 0.02s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      3.54s   100%  runtime.goexit
         0     0%     0%      2.85s 80.51%  main.Get
         0     0%     0%      2.85s 80.51%  main.get.func1
     0.26s  7.34%  7.34%      2.85s 80.51%  main.graphiteRaintankJSON
         0     0%  7.34%      2.85s 80.51%  net/http.(*ServeMux).ServeHTTP
         0     0%  7.34%      2.85s 80.51%  net/http.(*conn).serve
         0     0%  7.34%      2.85s 80.51%  net/http.HandlerFunc.ServeHTTP
         0     0%  7.34%      2.85s 80.51%  net/http.serverHandler.ServeHTTP
     0.06s  1.69%  9.04%      1.47s 41.53%  strconv.AppendFloat
     0.11s  3.11% 12.15%      1.41s 39.83%  strconv.genericFtoa
     0.19s  5.37% 17.51%      1.30s 36.72%  strconv.bigFtoa
         0     0% 17.51%      1.29s 36.44%  runtime.growslice
     1.06s 29.94% 47.46%      1.06s 29.94%  runtime.memmove
         0     0% 47.46%      0.76s 21.47%  runtime.growslice_n
         0     0% 47.46%      0.69s 19.49%  runtime.gcBgMarkWorker
         0     0% 47.46%      0.69s 19.49%  runtime.gcDrain
     0.41s 11.58% 59.04%      0.68s 19.21%  runtime.scanobject
     0.08s  2.26% 61.30%      0.68s 19.21%  strconv.formatDigits
     0.19s  5.37% 66.67%      0.60s 16.95%  strconv.fmtF
     0.32s  9.04% 75.71%      0.32s  9.04%  runtime.duffzero
     0.29s  8.19% 83.90%      0.29s  8.19%  runtime.memclr
     0.07s  1.98% 85.88%      0.21s  5.93%  strconv.AppendUint
     0.13s  3.67% 89.55%      0.18s  5.08%  runtime.greyobject
     0.09s  2.54% 92.09%      0.14s  3.95%  strconv.formatBits
     0.09s  2.54% 94.63%      0.09s  2.54%  runtime.heapBitsForObject
     0.05s  1.41% 96.05%      0.07s  1.98%  strconv.(*decimal).Assign
         0     0% 96.05%      0.05s  1.41%  runtime.heapBits.setMarked
     0.05s  1.41% 97.46%      0.05s  1.41%  runtime/internal/atomic.Or8
     0.05s  1.41% 98.87%      0.05s  1.41%  strconv.(*decimal).Round
         0     0% 98.87%      0.03s  0.85%  runtime.mallocgc
         0     0% 98.87%      0.03s  0.85%  runtime.rawmem
         0     0% 98.87%      0.03s  0.85%  runtime.systemstack
         0     0% 98.87%      0.02s  0.56%  runtime.heapBits.initSpan
         0     0% 98.87%      0.02s  0.56%  runtime.largeAlloc
         0     0% 98.87%      0.02s  0.56%  runtime.mallocgc.func3
     0.02s  0.56% 99.44%      0.02s  0.56%  strconv.trim

another experiment, bisecting locally compiled MT instances:

sudo service metric_tank stop; sudo cp metric_tank /usr/sbin/metric_tank ; sudo service metric_tank start; ps aux | grep metric_tank

latest master crash: https://snapshot.raintank.io/dashboard/snapshot/xMn4Otu7iHDE1697hp2fNRkj8LQi9Rl8
a85f3af : ? all request durations go up -> request volume drops, but mem is good. https://snapshot.raintank.io/dashboard/snapshot/6jEWIsSJmfC4629ucjGhF3l3yioBdnns
d1f58892104ee0515a0a727686469f7ebc59a2fc (after faster consolidation merge and before graphite endpoint): crash https://snapshot.raintank.io/dashboard/snapshot/eEZunlcEbH3VE9sn6l3vjM9XtDlB41ds
f817029..seems to be sane just like number 2 above. https://snapshot.raintank.io/dashboard/snapshot/iWlECdF0zfn8zGiGqdLQRasgg0T9v4pO
054064d allocate less + execute faster by doing "usable" logic inline looking good, not much slowdown https://snapshot.raintank.io/dashboard/snapshot/NXWILw7u6i2ClrjNCfNG5kWbJB31yDE1
6e6cb96 : fix(): be more clever about how to fill the output slice instant crash https://snapshot.raintank.io/dashboard/snapshot/5dOvFzCUd97Kz1Un7r1MO4DQENJte6vD
dac6121 compute avg inline instead of iterating twice solid https://snapshot.raintank.io/dashboard/snapshot/kun7X4w1c36Ath4TzZ3RSWK56jXJdFYg
ccc19d2 add benchmark for fix(): OK https://snapshot.raintank.io/dashboard/snapshot/w2P1v3B8zaNMEc6iAQcr4C2VyCcOiyT2
let's try 6e6cb96 fix(): be more clever about how to fill the output slice again: another confirmed instant crash! https://snapshot.raintank.io/dashboard/snapshot/7OqCkXG3Myc508MaDBelgle29Y8Zas1W

ok so we now know 6e6cb96 is the culprit, or somehow triggers a bug elsewhere.

8d18f34313097fbda66bfd99a321267790a34df5 disable colors in metric_tank's logs -> instant crash https://snapshot.raintank.io/dashboard/snapshot/i8CzC9yqctFBkHHWpkLN7NvnsbTTeCGf

same but 6e6cb96 reverted: works fine again : https://snapshot.raintank.io/dashboard/snapshot/qeM4pfKB8ret4vwVuA7F7b032u2vve4u

I also tried some patches, all visible here https://gist.github.com/Dieterbe/e9414e91564f185ea7c7 specifically:

reworking the logic a bit for extra clarity and because i thought maybe at the end it gets stuck in a loop, but the code changes made no effect
log a generated id for each run of fix() along with how the variable o changes over time -> this showed that for each request 0 just went up and then fix returned. it never stayed the same value (confirming again not stuck in busy loop)
id + number of concurrent fixers running at the same time. this went up to around 28 but stayed 1~2 until before the crash

the focus on fix was due to #171 and also because that's what got changed in 6e6cb96

got another mem profile, this time with

+       runtime.MemProfileRate = 1

this shows:

(pprof) top50 -cum
25753346.64kB of 26036249.83kB total (98.91%)
Dropped 633 nodes (cum <= 130181.25kB)
      flat  flat%   sum%        cum   cum%
         0     0%     0% 26035269.52kB   100%  runtime.goexit
         0     0%     0% 20136622.62kB 77.34%  main.getTargets.func1
         0     0%     0% 20136621.12kB 77.34%  main.getTarget
20134415.44kB 77.33% 77.33% 20134415.44kB 77.33%  main.fix
    3.72kB 1.4e-05% 77.33% 3419776.31kB 13.13%  net/http.(*conn).serve
         0     0% 77.33% 3416172.34kB 13.12%  net/http.(*ServeMux).ServeHTTP
         0     0% 77.33% 3416172.34kB 13.12%  net/http.HandlerFunc.ServeHTTP
         0     0% 77.33% 3416172.34kB 13.12%  net/http.serverHandler.ServeHTTP
   48.81kB 0.00019% 77.33% 3416147.31kB 13.12%  main.Get
         0     0% 77.33% 3416147.31kB 13.12%  main.get.func1
   88.34kB 0.00034% 77.33% 3411948.09kB 13.10%  main.graphiteRaintankJSON
         0     0% 77.33% 3411709.25kB 13.10%  strconv.AppendFloat
         0     0% 77.33% 3411709.25kB 13.10%  strconv.bigFtoa
3411709.25kB 13.10% 90.44% 3411709.25kB 13.10%  strconv.fmtF
         0     0% 90.44% 3411709.25kB 13.10%  strconv.formatDigits
         0     0% 90.44% 3411709.25kB 13.10%  strconv.genericFtoa
         0     0% 90.44% 1467987.61kB  5.64%  github.com/nsqio/go-nsq.(*Consumer).handlerLoop
         0     0% 90.44% 1467982.89kB  5.64%  main.(*Handler).HandleMessage
27464.53kB  0.11% 90.54% 1144988.02kB  4.40%  main.(*AggMetrics).GetOrCreate
800744.55kB  3.08% 93.62% 1117522.61kB  4.29%  main.NewAggMetric
    0.33kB 1.3e-06% 93.62% 1004349.55kB  3.86%  main.main
         0     0% 93.62% 1004349.55kB  3.86%  runtime.main
105592.69kB  0.41% 94.02% 950334.19kB  3.65%  main.NewAggregator
    0.16kB 6e-07% 94.02% 613466.55kB  2.36%  main.NewDefCache
         0     0% 94.02% 613466.39kB  2.36%  main.(*DefCache).Backfill
         0     0% 94.02% 556115.67kB  2.14%  github.com/raintank/raintank-metric/metricdef.(*DefsEs).GetMetrics
97334.16kB  0.37% 94.40% 556110.27kB  2.14%  github.com/raintank/raintank-metric/schema.MetricDefinitionFromJSON
         0     0% 94.40% 458785.92kB  1.76%  encoding/json.Unmarshal
         0     0% 94.40% 454360.55kB  1.75%  encoding/json.(*decodeState).object
         0     0% 94.40% 454360.55kB  1.75%  encoding/json.(*decodeState).unmarshal
         0     0% 94.40% 454360.55kB  1.75%  encoding/json.(*decodeState).value
390641.62kB  1.50% 95.90% 390749.15kB  1.50%  main.NewCassandraStore
    0.22kB 8.4e-07% 95.90% 293175.67kB  1.13%  main.(*AggMetric).Add
211988.97kB  0.81% 96.71% 212170.38kB  0.81%  fmt.Sprintf
70345.72kB  0.27% 96.98% 211037.25kB  0.81%  main.NewChunk
         0     0% 96.98% 194668.31kB  0.75%  reflect.Value.SetMapIndex
194668.31kB  0.75% 97.73% 194668.31kB  0.75%  reflect.mapassign
         0     0% 97.73% 167608.80kB  0.64%  encoding/json.(*decodeState).literal
167608.39kB  0.64% 98.37% 167608.80kB  0.64%  encoding/json.(*decodeState).literalStore
         0     0% 98.37% 147612.88kB  0.57%  main.(*AggMetric).addAggregators
         0     0% 98.37% 147612.88kB  0.57%  main.(*Aggregator).Add
         0     0% 98.37% 147612.88kB  0.57%  main.(*Aggregator).flush
140691.44kB  0.54% 98.91% 140691.53kB  0.54%  github.com/dgryski/go-tsz.New
(pprof) list main.fix
Total: 24.83GB
ROUTINE ======================== main.fix in /home/dieter/go/src/github.com/raintank/raintank-metric/metric_tank/dataprocessor.go
   19.20GB    19.20GB (flat, cum) 77.33% of Total
         .          .     51:       start = from + interval - remain
         .          .     52:   }
         .          .     53:
         .          .     54:   // last point should be the last value that divides by interval lower than to (because to is always exclusive)
         .          .     55:   lastPoint := (to - 1) - ((to - 1) % interval)
   19.20GB    19.20GB     56:   out := make([]schema.Point, (lastPoint-start)/interval+1)
         .          .     57:
         .          .     58:   // t is the ts we're looking to fill.
         .          .     59:   // i iterates in
         .          .     60:   // o iterates out
         .          .     61:   for t, i, o := start, 0, -1; t <= lastPoint; t += interval {
(pprof)

really weird that that line is the culprit since it's very similar to the original line.

i see two possibilities: 1) stack escape analysis treats this slice allocation different as before

2) somewhere further down the read path the slice leaks and is not reclaimed. but the leak would have to be caused somehow by the perf improvement.. which seems a bit unlikely. also i don't see where it could leak. divide() drops references to it. consolidate() passes it on, and ultimately after graphiteJSON/graphiteRaintankJSON we should have no more references to this slice.

$ b -gcflags='-m' 2>&1 | grep dataprocessor.go:56
./dataprocessor.go:56: make([]schema.Point, (lastPoint - start) / interval + 1) escapes to heap
./dataprocessor.go:56: make([]schema.Point, (lastPoint - start) / interval + 1) escapes to heap

with patch reverted:

$ b -gcflags='-m' 2>&1 | grep dataprocessor.go:39
./dataprocessor.go:39: make([]schema.Point, 0, len(in)) escapes to heap
./dataprocessor.go:39: make([]schema.Point, 0, len(in)) escapes to heap

aha! I think i found the bug. fix() is getting incoming requests such as from 1458966244 to 1458966274 (e.g. a 30s span) but interval is 60s and len becomes 71582788 see http://play.golang.org/p/VSE-srJYOv

the input slices seem to be empty in these cases btw:

+fmt.Println(">fix", "f", from, "t", to, "int", interval, " len", (lastPoint-start)/interval+1)

>fix [{0 1458967095} {0 1458967105} {0 1458967115}] f 1458967095 t 1458967125 int 10  len 3
>fix [] f 1458967095 t 1458967125 int 10  len 3
>fix [] f 1458967095 t 1458967125 int 60  len 71582788
>fix [{0 1458967095} {0 1458967105} {0 1458967115}] f 1458967095 t 1458967125 int 10  len 3

similar issue was #164

grafana / metrictank

new version blows up when opening for requests #180