Closed guillaumeautran closed 7 years ago
My config looks like that:
instance = "default"
max_procs = 2
listen_addr = "127.0.0.1:2003"
#pickle_addr = "0.0.0.0:2013"
admin_addr = "127.0.0.1:2004"
#http_addr = "0.0.0.0:8881"
spool_dir = "/var/spool/carbon-relay-ng"
#spool_dir = "spool"
pid_file = "/var/run/carbon-relay-ng.pid"
#one of critical error warning notice info debug
log_level = "notice"
# How long to keep track of invalid metrics seen
# Useful time units are "s", "m", "h"
bad_metrics_max_age = "24h"
# Metric name validation strictness for legacy metrics. Valid values are:
# strict - Block anything that can upset graphite: valid characters are [A-Za-z0-9_-.]; consecutive dots are not allowed
# medium - Valid characters are ASCII; no embedded NULLs
# none - No validation is performed
validation_level_legacy = "medium"
# Metric validation for carbon2.0 (metrics2.0) metrics.
# Metrics that contain = or _is_ are assumed carbon2.0.
# Valid values are:
# medium - checks for unit and mtype tag, presence of another tag, and constency (use = or _is_, not both)
# none - No validation is performed
validation_level_m20 = "medium"
# you can also validate that each series has increasing timestamps
validate_order = false
# put init commands here, in the same format as you'd use for the telnet interface
# here's some examples:
init = [
'addBlack prefix service_is_carbon-relay-ng',
'addBlack prefix collectd-raw_',
'addRewriter collectd_ collectd-raw_ 1',
'addRewriter statsd.gauge- gauge- 1',
'addRewriter statsd.latency- latency- 1',
'addRewriter statsd.derive- derive- 1',
'addRewriter statsd.count- counter- 1',
'addRewriter cpr_metrics_ metric- 1',
'addAgg sum collectd-raw_(.*) .sum.$1 30 10',
'addAgg stdev collectd-raw_(.*) .stdev.$1 30 10',
'addAgg avg collectd-raw_(.*) .avg.$1 30 10',
'addAgg max collectd-raw_(.*) .max.$1 30 10',
'addAgg min collectd-raw_(.*) .min.$1 30 10',
'addAgg last collectd-raw_(.*) .last.$1 30 10',
'addAgg delta collectd-raw_(.*) .delta.$1 30 10',
'addRoute sendAllMatch metrics-logstash regex=^(.stdev.|.sum.|.max.|.avg.|.min.|.last.|.delta.) localhost:2005 spool=true pickle=false',
'addRoute sendAllMatch black-carbon prefix=collectd-raw_ localhost:2003 spool=false pickle=false',
]
[instrumentation]
# in addition to serving internal metrics via expvar, you can send them to graphite/carbon
# IMPORTANT: setting this to "" will disable flushing, and metrics will pile up and lead to OOM
# see https://github.com/graphite-ng/carbon-relay-ng/issues/50
# so for now you MUST send them somewhere. sorry.
graphite_addr = "localhost:2003"
graphite_interval = 120000 # in ms (2mins)
And my input is standard collectd data coming in (a couple of thousands every 10 seconds). Carbon seems to panic when generating the aggregations.
which version is this exactly? looking at the code for 0.8.8 there is no slice lookup on line 248
Running the release 0.8.8-1 version.
I think there is something in the aggregator.go:L248
-> func (agg *Aggregator) run()
function. A regex callback is made. The panic does not happen everytime though. So, it must be data dependent but it happens 5 times out of 10...
I'll try to instrument the code and see what happens. Also need to do some reading on go debuggers.
func (agg *Aggregator) run() {
interval := time.Duration(agg.Interval) * time.Second
ticker := getAlignedTicker(interval)
for {
select {
case fields := <-agg.In:
// note, we rely here on the fact that the packet has already been validated
key := fields[0]
matches := agg.regex.FindSubmatchIndex(key)
if len(matches) == 0 {
continue
}
value, _ := strconv.ParseFloat(string(fields[1]), 64)
t, _ := strconv.ParseUint(string(fields[2]), 10, 0)
ts := uint(t)
var dst []byte
here => outKey := string(agg.regex.Expand(dst, agg.outFmt, key, matches))
quantized := ts - (ts % agg.Interval)
agg.AddOrCreate(outKey, quantized, value)
case now := <-ticker.C:
thresh := now.Add(-time.Duration(agg.Wait) * time.Second)
agg.Flush(uint(thresh.Unix()))
ticker = getAlignedTicker(interval)
case <-agg.snapReq:
...
my mistake, the panic originates out of the regex package, i was looking in the wrong place. it has something to do with the number of matching groups in the regular expression I think. though your config looks sane (1 matching group, and you reference only 1)
I think the best way forward is if you could add a print line right before the offending line. ie right before
outKey := string(agg.regex.Expand(dst, agg.outFmt, key, matches))
.
something like fmt.Printf("calling Expand.outFmt=%s, key=%s, matches=%v\n", agg.outFmt, key, matches)
Ok, we are getting somewhere:
calling Expand.outFmt=.delta.$1, key=collectd-raw_robot-ph-cpe22-14_.gauge-metric-battery_HEALTH_SAFETY_TIMER_EXPIRE.value, matches=[0 85 13 85]
calling Expand.outFmt=.max.$1, key=collectd-raw_robot-ph-cpe22-14_.gauge-metric-battery_HEALTH_GOOD.value, matches=[0 70 13 70]
calling Expand.outFmt=.max.$1, key=collectd-raw_robot-ph-cpe22-14_.gauge-metric-battery_HEALTH_SAFETY_TIMER_EXPIRE.value, matches=[0 85 13 85]
calling Expand.outFmt=.max.$1, key=collectd-raw_robot-ph-cpe22-14_.statsd.gauge-cpr_metrics_battery_HEALTH_UNSPEC_FAILURE.value, matches=[0 92 13 92]
panic: runtime error: slice bounds out of range
goroutine 19 [running]:
regexp.(*Regexp).expand(0xc421201d60, 0xc4209884e8, 0x0, 0x8, 0xc4209884e5, 0x2, 0xc421016120, 0x5c, 0x55, 0x0, ...)
/usr/local/go/src/regexp/regexp.go:828 +0x3de
regexp.(*Regexp).Expand(0xc421201d60, 0x0, 0x0, 0x0, 0xc4211f2e98, 0x7, 0x8, 0xc421016120, 0x5c, 0x55, ...)
/usr/local/go/src/regexp/regexp.go:793 +0x10f
github.com/graphite-ng/carbon-relay-ng/aggregator.(*Aggregator).run(0xc4200b4b40)
/home/gautran/go_code/src/github.com/graphite-ng/carbon-relay-ng/aggregator/aggregator.go:249 +0x502
created by github.com/graphite-ng/carbon-relay-ng/aggregator.New
/home/gautran/go_code/src/github.com/graphite-ng/carbon-relay-ng/aggregator/aggregator.go:180 +0x462
Don't mind the line numbers as this is debug code.
See how the last metric: collectd-raw_robot-ph-cpe22-14_.statsd.gauge-cpr_metrics_battery_HEALTH_UNSPEC_FAILURE.value
has only partially be re-written by the rewriter rule?
Is my configuration bogus or is something else wrong?
Yeap, I am seeing a consistent pattern of metrics not being fully re-written and causing the panic in the aggregator.
I'll need a bit of help to understand how the chain goes here and how the rewriting rules get applied.
See how the last metric:
hmm i don't see it. i recommend you either join our slack http://slack.raintank.io/ and join the carbon-relay-ng channel, or ping me on irc on freenode (Dieterbe) and we can figure this out faster.
Carbon relay abort with a panic message.