grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready
Other
467 stars 151 forks source link

panic: runtime error: slice bounds out of range #197

Closed guillaumeautran closed 7 years ago

guillaumeautran commented 7 years ago

Carbon relay abort with a panic message.

panic: runtime error: slice bounds out of range

goroutine 26 [running]:
regexp.(*Regexp).expand(0xc421207e00, 0xc4214cbbf0, 0x0, 0x8, 0xc4214cbbde, 0x2, 0xc4212e4870, 0x4b, 0x46, 0x0, ...)
    /usr/local/go/src/regexp/regexp.go:828 +0x3de
regexp.(*Regexp).Expand(0xc421207e00, 0x0, 0x0, 0x0, 0xc4211fae98, 0x7, 0x8, 0xc4212e4870, 0x4b, 0x46, ...)
    /usr/local/go/src/regexp/regexp.go:793 +0x10f
github.com/graphite-ng/carbon-relay-ng/aggregator.(*Aggregator).run(0xc42014aa80)
    /home/ubuntu/go/src/github.com/graphite-ng/carbon-relay-ng/aggregator/aggregator.go:248 +0x39f
created by github.com/graphite-ng/carbon-relay-ng/aggregator.New
    /home/ubuntu/go/src/github.com/graphite-ng/carbon-relay-ng/aggregator/aggregator.go:180 +0x462
guillaumeautran commented 7 years ago

My config looks like that:

instance = "default"

max_procs = 2

listen_addr = "127.0.0.1:2003"
#pickle_addr = "0.0.0.0:2013"
admin_addr = "127.0.0.1:2004"
#http_addr = "0.0.0.0:8881"
spool_dir = "/var/spool/carbon-relay-ng"
#spool_dir = "spool"
pid_file = "/var/run/carbon-relay-ng.pid"
#one of critical error warning notice info debug
log_level = "notice"
# How long to keep track of invalid metrics seen
# Useful time units are "s", "m", "h"
bad_metrics_max_age = "24h"
# Metric name validation strictness for legacy metrics. Valid values are:
# strict - Block anything that can upset graphite: valid characters are [A-Za-z0-9_-.]; consecutive dots are not allowed
# medium - Valid characters are ASCII; no embedded NULLs
# none   - No validation is performed
validation_level_legacy = "medium"
# Metric validation for carbon2.0 (metrics2.0) metrics.
# Metrics that contain = or _is_ are assumed carbon2.0.
# Valid values are:
# medium - checks for unit and mtype tag, presence of another tag, and constency (use = or _is_, not both)
# none   - No validation is performed
validation_level_m20 = "medium"

# you can also validate that each series has increasing timestamps
validate_order = false

# put init commands here, in the same format as you'd use for the telnet interface
# here's some examples:
init = [
      'addBlack prefix service_is_carbon-relay-ng',
      'addBlack prefix collectd-raw_',

      'addRewriter collectd_ collectd-raw_ 1',
      'addRewriter statsd.gauge- gauge- 1',
      'addRewriter statsd.latency- latency- 1',
      'addRewriter statsd.derive- derive- 1',
      'addRewriter statsd.count- counter- 1',
      'addRewriter cpr_metrics_ metric- 1',

      'addAgg sum collectd-raw_(.*) .sum.$1 30 10',
      'addAgg stdev collectd-raw_(.*) .stdev.$1 30 10',
      'addAgg avg collectd-raw_(.*) .avg.$1 30 10',
      'addAgg max collectd-raw_(.*) .max.$1 30 10',
      'addAgg min collectd-raw_(.*) .min.$1 30 10',
      'addAgg last collectd-raw_(.*) .last.$1 30 10',
      'addAgg delta collectd-raw_(.*) .delta.$1 30 10',

      'addRoute sendAllMatch metrics-logstash regex=^(.stdev.|.sum.|.max.|.avg.|.min.|.last.|.delta.)  localhost:2005 spool=true pickle=false',
      'addRoute sendAllMatch black-carbon prefix=collectd-raw_  localhost:2003 spool=false pickle=false',

]

[instrumentation]
# in addition to serving internal metrics via expvar, you can send them to graphite/carbon
# IMPORTANT: setting this to "" will disable flushing, and metrics will pile up and lead to OOM
# see https://github.com/graphite-ng/carbon-relay-ng/issues/50
# so for now you MUST send them somewhere. sorry.
graphite_addr = "localhost:2003"
graphite_interval = 120000  # in ms (2mins)

And my input is standard collectd data coming in (a couple of thousands every 10 seconds). Carbon seems to panic when generating the aggregations.

Dieterbe commented 7 years ago

which version is this exactly? looking at the code for 0.8.8 there is no slice lookup on line 248

guillaumeautran commented 7 years ago

Running the release 0.8.8-1 version. I think there is something in the aggregator.go:L248 -> func (agg *Aggregator) run() function. A regex callback is made. The panic does not happen everytime though. So, it must be data dependent but it happens 5 times out of 10... I'll try to instrument the code and see what happens. Also need to do some reading on go debuggers.

func (agg *Aggregator) run() {
        interval := time.Duration(agg.Interval) * time.Second
    ticker := getAlignedTicker(interval)
        for {
                select {
                case fields := <-agg.In:
                        // note, we rely here on the fact that the packet has already been validated                                                                            
                    key := fields[0]

                        matches := agg.regex.FindSubmatchIndex(key)
                if len(matches) == 0 {
                                continue
                    }
                        value, _ := strconv.ParseFloat(string(fields[1]), 64)
                        t, _ := strconv.ParseUint(string(fields[2]), 10, 0)
                        ts := uint(t)

                        var dst []byte
here =>           outKey := string(agg.regex.Expand(dst, agg.outFmt, key, matches))
                        quantized := ts - (ts % agg.Interval)
            agg.AddOrCreate(outKey, quantized, value)
                case now := <-ticker.C:
                        thresh := now.Add(-time.Duration(agg.Wait) * time.Second)
                        agg.Flush(uint(thresh.Unix()))
                        ticker = getAlignedTicker(interval)
        case <-agg.snapReq:
        ...
Dieterbe commented 7 years ago

my mistake, the panic originates out of the regex package, i was looking in the wrong place. it has something to do with the number of matching groups in the regular expression I think. though your config looks sane (1 matching group, and you reference only 1)

I think the best way forward is if you could add a print line right before the offending line. ie right before outKey := string(agg.regex.Expand(dst, agg.outFmt, key, matches)). something like fmt.Printf("calling Expand.outFmt=%s, key=%s, matches=%v\n", agg.outFmt, key, matches)

guillaumeautran commented 7 years ago

Ok, we are getting somewhere:

calling Expand.outFmt=.delta.$1, key=collectd-raw_robot-ph-cpe22-14_.gauge-metric-battery_HEALTH_SAFETY_TIMER_EXPIRE.value, matches=[0 85 13 85]
calling Expand.outFmt=.max.$1, key=collectd-raw_robot-ph-cpe22-14_.gauge-metric-battery_HEALTH_GOOD.value, matches=[0 70 13 70]
calling Expand.outFmt=.max.$1, key=collectd-raw_robot-ph-cpe22-14_.gauge-metric-battery_HEALTH_SAFETY_TIMER_EXPIRE.value, matches=[0 85 13 85]
calling Expand.outFmt=.max.$1, key=collectd-raw_robot-ph-cpe22-14_.statsd.gauge-cpr_metrics_battery_HEALTH_UNSPEC_FAILURE.value, matches=[0 92 13 92]
panic: runtime error: slice bounds out of range

goroutine 19 [running]:
regexp.(*Regexp).expand(0xc421201d60, 0xc4209884e8, 0x0, 0x8, 0xc4209884e5, 0x2, 0xc421016120, 0x5c, 0x55, 0x0, ...)
    /usr/local/go/src/regexp/regexp.go:828 +0x3de
regexp.(*Regexp).Expand(0xc421201d60, 0x0, 0x0, 0x0, 0xc4211f2e98, 0x7, 0x8, 0xc421016120, 0x5c, 0x55, ...)
    /usr/local/go/src/regexp/regexp.go:793 +0x10f
github.com/graphite-ng/carbon-relay-ng/aggregator.(*Aggregator).run(0xc4200b4b40)
    /home/gautran/go_code/src/github.com/graphite-ng/carbon-relay-ng/aggregator/aggregator.go:249 +0x502
created by github.com/graphite-ng/carbon-relay-ng/aggregator.New
    /home/gautran/go_code/src/github.com/graphite-ng/carbon-relay-ng/aggregator/aggregator.go:180 +0x462

Don't mind the line numbers as this is debug code.

See how the last metric: collectd-raw_robot-ph-cpe22-14_.statsd.gauge-cpr_metrics_battery_HEALTH_UNSPEC_FAILURE.value has only partially be re-written by the rewriter rule?

Is my configuration bogus or is something else wrong?

guillaumeautran commented 7 years ago

Yeap, I am seeing a consistent pattern of metrics not being fully re-written and causing the panic in the aggregator.

I'll need a bit of help to understand how the chain goes here and how the rewriting rules get applied.

Dieterbe commented 7 years ago

See how the last metric:

hmm i don't see it. i recommend you either join our slack http://slack.raintank.io/ and join the carbon-relay-ng channel, or ping me on irc on freenode (Dieterbe) and we can figure this out faster.