Intermittent Runtime Error: index out of range panics for certain queries

leikahing commented 7 years ago

This is occurring in the latest master commit 9a17fbb:

I have been noticing some problems with a number of queries resulting in runtime error: index out of range messages.

I'm running a Grafana Server with carbonapi and carbonzipper.
carbonzipper is configured to connect to a carbonserver on a separate server (a go-carbon box sitting on the same network)

carbonapi daemon parameters:

carbonapi -z http://localhost:8000 -p 8080 -logdir /opt/carbonapi/log 
         -cpus 3 -l 40 -cache memcache
         -mc graphite-production.wx5gjk.0001.euw1.cache.amazonaws.com

carbonzipper configuration:

{
  "Port": 8000,
  "TimeoutMs": 150000,
  "TimeoutMsAfterAllStarted": 300000,
  "ConcurrencyLimitPerServer": 0,
  "Backends": [
    "http://10.58.192.205:9001"
  ]
}

carbonserver daemon parameters:

carbonserver -p 9001 -w /mnt/data/whisper
    -maxexpand 100 -logdir /opt/carbonserver/log
    -maxprocs 10 -scanfreq 15m0s -vv

When I run queries for time periods like last 15 minutes, or last 30 minutes, or last 6 hours, things run fine.

When I expand my query out to last 7 days, that's when I start noticing this behavior.

I will get the following panic on certain queries in the carbonapi log file:

2017/02/20 21:48:17 panic during eval: format=json&from=-7d&maxDataPoints=100&target=groupByNode%28consolidateBy%28orp.%7Bprod%7D.%7Bap-northeast-1%2Cap-northeast-1b%2Ceu-west%2Ceu-west-1%2Cus-east%2Cus-east-1%7D.%7Bap%2Ccolumbus%2Cdcl%2Cdocrefpreview%2Cmcpcomm%2Cpdt%2Cuds%2Cunknown%2Cviper%2Cvp%2Cvp-dev%2Cvp-loc%2Cvp-tst%2Ccse%2Cvp-test%7D.%7Bunknown%2Clp%2Clp-redirect%2Cgp%2Cgp-browsercapture%2Cgp-personalizationcontrol%2Cgp-personalizationtest%2Cgp-redirect%2Cimage%2Clp-browsercapture%2Clp-prepress%2Cstudio%2Ccalgrid%2Cgp-fastpath%2Clp-animation%2Clp-directscene%2Clp-fastpath%2Clp-personalizationtest%2Clp-personalizationcontrol%2Canimation%2Cgp-redirffect%2Clp-directscene-redirect%2Cgp-directscene%7D.requests.total.count%2C+%27sum%27%29%2C+5%2C+%27sum%27%29&until=now: runtime error: index out of range
goroutine 2708930 [running]:
main.renderHandler.func2.1(0xc8230de300, 0x2e5)
        /home/ubuntu/carbonapi/main.go:415 +0xb0
panic(0xa35320, 0xc82000e050)
        /usr/lib/go-1.6/src/runtime/panic.go:443 +0x4e9
github.com/dgryski/carbonapi/expr.aggregateSeries(0xc82e729560, 0xc825576000, 0x1c1, 0x200, 0xc079a0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/gohome/src/github.com/dgryski/carbonapi/expr/expr.go:3660 +0x78d
github.com/dgryski/carbonapi/expr.EvalExpr(0xc82e729560, 0x58ab641f58a2299f, 0xc821fc5230, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/gohome/src/github.com/dgryski/carbonapi/expr/expr.go:2598 +0x2f4d
github.com/dgryski/carbonapi/expr.EvalExpr(0xc82350e240, 0x58ab641f58a2299f, 0xc82104b388, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/ubuntu/gohome/src/github.com/dgryski/carbonapi/expr/expr.go:1411 +0x21d0c
main.renderHandler.func2(0xc8230de300, 0x2e5, 0xc82350e240, 0x58ab641f58a2299f, 0xc82104b388, 0xc82104b1f8, 0xc82104af18, 0xc82104b180)
        /home/ubuntu/carbonapi/main.go:419 +0x8e
main.renderHandler(0x7f9e69c7bbc0, 0xc8229f3a40,

Retrying the query multiple times sometimes succeeds, but most times it fails and returns an empty list.

Another query I'm seeing this with is:

format=json&from=-7d&maxDataPoints=157&target=timeShift(sumSeries(consolidateBy(orp.{prod}.{ap-northeast-1,ap-northeast-1b,eu-west,eu-west-1,us-east,us-east-1}.hw.*.hostup,+'sum')),+'120s')&until=now

Here, the wildcard * actually expands out to 5000+ unique subfolders (server metrics for every server the service has or ever had running). At the 7 day and beyond mark, I can't get data for that query at all.

All whisper files in question have the following storage schema set on them:

10s:1d,1m:7d,1h:1y,1d:5y

I have another server running an old version of Graphite 0.9.12 where all my metrics are mirrored that is serving up these queries just fine, so I'm wondering where my configuration might be falling short.

nnuss commented 7 years ago

My first suspicion is that carbonserver is returning series with mixed density, as your request is right on a rollup boundary. This would frequently break the logic in the internal aggregateSeries func which expects all series to be consistent length.

If you make it from=-8d does the problem persist?

leikahing commented 7 years ago

@nnuss - I did suspect that it might be related to the aggregation boundary. If I do -8d it works consistently and returns the data every time. Same if I do anything up to the aggregation window (like -6d or -5d), or if I keep going out further like -30d.

Attempting data at the rollout periods (-1d, -7d, -1y) I can consistently reproduce the problem of either getting data back or getting back [].

nnuss commented 7 years ago

That strongly suggests to me that carbonserver is racing with the whisper writes and/or time skew on the backend(s).

@Civil is there any go-carbon / carbonserver tuning to help here?

@birryree how far beyond -7d is required to get a consistent view?

-7d1s
-7d1min
longer?

We would like carbonapi to resolve this but it requires more meta information to be available during render: retention bands, aggregation function, and x-files factor for each series.

Civil commented 7 years ago

First of all, try to use go-carbon's carbonserver. It have support for that since 0.9.0 and it's also covered with some tests, so it will be easier to fix an error there (if it's not fixed there). Also you can do that not as a separate instance, but as part of your main go-carbon.

leikahing commented 7 years ago

@nnuss If I modify the first query seen as part of the stacktrace log entry above and use from=-7d1s, it consistently returns results. Wrote a shell script to hit the API hundreds of times and every time a non-empty data list was returned. If I modify my script to use from=-7d, the results are back to either [] or non-empty again.

I checked the backend servers - every server involved runs ntpd and is synchronized to the same server pool. Checking their dates, they all report the same time.

@Civil I originally was trying to run go-carbon's carbonserver when I set this stack up, but could never get it running - netstat didn't show anything listening on the port I configured it to run on. I don't see anything in the logs, even with debug logging.

I'm running the latest 0.9.0 release via .deb package - my configuration for go-carbon follows:

[common]
user = "carbon"
logfile = "/var/log/go-carbon/go-carbon.log"
log-level = "debug"
graph-prefix = "carbon.agents.eu-west-1.graphite-a"
metric-endpoint = "tcp://carbonrelay.internal:2003"
max-cpu = 3
metric-interval = "1m0s"

[whisper]
data-dir = "/mnt/data/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
workers = 8
max-updates-per-second = 0
sparse-create = false
enabled = true

[cache]
max-size = 1000000
write-strategy = "max"

[udp]
listen = ":2003"
enabled = true
log-incomplete = false
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
buffer-size = 0

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[carbonserver]
listen = "127.0.0.1:8080"
enabled = true
buckets = 10
max-globs = 1000
metrics-as-counters = false
read-timeout = "5m0s"
write-timeout = "5m0s"
scan-frequency = "10m0s"

[dump]
enabled = false
path = ""
restore-per-second = 0

[pprof]
listen = "localhost:7007"
enabled = false

Civil commented 7 years ago

Interesting. Actually if carbonserver support is enabled, it should print at least "[carbonserver] carbonserver support is still experimental, use at your own risk" in the logs. If it's not - maybe there are some problems with your deb package and what you've got is not 0.9. Also might be worth to test is to just build it manually and see if carbonserver module will work then.

leikahing commented 7 years ago

@Civil As it turns out, when I execute sudo service go-carbon stop or sudo service go-carbon restart, it wasn't actually stopping/restarting the go-carbon daemon, so it was never re-reading my configuration and starting the go-carbon server. I had to do a manual kill and bring the service back for it to recognize my configuration changes and start the carbonserver.

I'm running the current go-carbon off the head of master, and I've seen a slight improvement in some queries. I can do this hardware scan now of thousands of whisper files (with some success):

format=json&from=-7d&maxDataPoints=157&target=timeShift(sumSeries(consolidateBy(orp.{prod}.{ap-northeast-1,ap-northeast-1b,eu-west,eu-west-1,us-east,us-east-1}.hw.*.hostup,+'sum')),+'120s')&until=now

However, I'm still running into the issue with queries against aggregation boundaries sometimes returning [].

tehlers320 commented 7 years ago

I just spun this up to test it to based off the 0.9.0 go-carbon and latest master of zipper/api today. sumSeries seems to break. Ive attached the the debug output from carbonzipper and the debug line from carbonserver1 im not sure how to interpret this.

brokencarbonapiquery.txt

These queries are for 3hours and the data most likely has gaps in the data due to some UDP loss. I wonder if its the nulls causing the issue? The best feature in the world "keeplastvalue" does not seem to solve the issue as normal with graphite!

My first aggregation block is at the 6 hour window. So it breaks on raw data for me.

tehlers320 commented 7 years ago

Well that's interesting i just realized one of the files has an old aggregator setting (as shown in the debug logs above). Legacy graphite handles this just fine. However it looks like if the aggregations between 2 retrieved points don't match the problem occurs in carbonapi. I went in and whisper-resized the collectd tree, now carbonapi seems consistent now for me. But it doesn't explain the other issues listed here unless in some cases carbonserver is not sending the same amount of datapoints?

go-graphite / carbonapi

Intermittent Runtime Error: index out of range panics for certain queries #186