influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.89k stars 5.6k forks source link

[1.2.1] slice bounds out of range #2488

Closed rossmcdonald closed 7 years ago

rossmcdonald commented 7 years ago

Bug report

Telegraf v1.2.1 (git: release-1.2 3b6ffb344e5c03c1595d862282a6823ecb438cff) 

Relevant telegraf.conf:

[agent]
collection_jitter = "0s"
debug = true
flush_buffer_when_full = true
flush_interval = "30s"
flush_jitter = "30s"
hostname = "hostname"
interval = "10s"
metric_buffer_limit = 10000
round_interval = true
quiet = false

[inputs]

[inputs.netstat]

[inputs.processes]

[inputs.tcp_listener]
allowed_pending_messages = 10000
max_tcp_connections = 250
data_format = "influx"
service_address = ":8090"

[outputs]

[outputs.influxdb]
database = "telegraf"
precision = "s"
urls = ["https://mydbhost:8086"]

[cpu]
drop = ["cpu_time"]
percpu = true
totalcpu = true

[disk]

[io]

[mem]

[swap]

[system]

Steps to reproduce:

Seeing this panic fairly regularly:

panic: runtime error: slice bounds out of range

goroutine 254475 [running]:
panic(0xf2b720, 0xc4200100b0)
#011/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/telegraf/metric.(*metric).Fields(0xc4205b2480, 0xc42034a3c0)
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:271 +0x422
github.com/influxdata/telegraf/metric.(*metric).Point(0xc4205b2480, 0xc42092e9b0)
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:141 +0x80
github.com/influxdata/telegraf/plugins/outputs/influxdb.(*InfluxDB).Write(0xc4201b8100, 0xc420257b00, 0x42, 0x42, 0xc4205bb658, 0x6b3f41)
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/outputs/influxdb/influxdb.go:194 +0x154
github.com/influxdata/telegraf/internal/models.(*RunningOutput).write(0xc42010a000, 0xc420257b00, 0x42, 0x42, 0x42, 0x60f612)
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/internal/models/running_output.go:173 +0xa1
github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc42010a000, 0x1157720, 0xc4204649f0)
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/internal/models/running_output.go:157 +0x49c
github.com/influxdata/telegraf/agent.(*Agent).flush.func1(0xc4204649f0, 0xc42010a000)
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:232 +0x68
created by github.com/influxdata/telegraf/agent.(*Agent).flush
#011/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:237 +0xb4

Let me know what other information is needed!

sparrc commented 7 years ago

It seems that this must be coming from the tcp_listener input plugin. For some reason there are metrics which are parsing correctly via tcp_listener, but then are actually invalid metrics that can't get translated to an InfluxDB point.

I think it's likely that this is fixed in 1.3, because the function that is panicking doesn't exist anymore, so it shouldn't panic but should instead raise an error when the metric gets written to InfluxDB (and will subsequently just be dropped).

Would be good to figure out what the problem metrics are, so we can then write a unit-test that will catch it.

miniskipper commented 7 years ago

My telegraf also crashes repeatedly using the following config:

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
[[outputs.influxdb]]
  urls = ["http://HOST:8086"] # required
  database = "DB" # required
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "USER"
  password = "PASS"
 [[inputs.logparser]]
   files = ["/var/log/httpd/*access_log"]
   from_beginning = false
   [inputs.logparser.grok]
     patterns = ["%{CUSTOM_LOG_FORMAT}"]
     measurement = "apache_access_log"
     custom_patterns = '''
     CUSTOM_LOG_FORMAT %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequ
est})" %{NUMBER:response} (?:%{NUMBER:bytes:int}|-) %{NUMBER:responsetime:int} us %{QS:Referrer} %{QS:Agent}
     '''

log output:

panic: runtime error: slice bounds out of range

goroutine 32 [running]: panic(0xf2b720, 0xc4200100d0) /usr/local/go/src/runtime/panic.go:500 +0x1a1 github.com/influxdata/telegraf/metric.(metric).Fields(0xc420094800, 0xc4206279c0) /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:279 +0x5dd github.com/influxdata/telegraf/plugins/inputs/logparser.(LogParserPlugin).parser(0xc42014a090) /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:205 +0x250 created by github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).Start /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:131 +0x62d

danielnelson commented 7 years ago

@miniskipper I think your issue is unrelated, can you open it as a new issue and also include some sample logs and, ideally, try to find a sample log that reproduces the crash.

gaving commented 7 years ago

Similar issue here when loading in an apache access_log w/ logparser:-

Telegraf v1.2.1 (git: release-1.2 3b6ffb344e5c03c1595d862282a6823ecb438cff)

[[inputs.logparser]]
  ## file(s) to tail:
  files = ["/tmp/input.log"]
  from_beginning = true
  name_override = "test_metric"

  ## For parsing logstash-style "grok" patterns:
  [inputs.logparser.grok]
    patterns = ["%{COMMON_LOG_FORMAT}"]

[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["stdout", "/tmp/output.log"]
  data_format = "influx"

[[outputs.influxdb]]

  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  urls = ["http://influxdb:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "telegraf" # required
  ## Write timeout (for the InfluxDB client), formatted as a string.
  timeout = "5s"

panic: runtime error: slice bounds out of range

goroutine 14 [running]:
panic(0xf2b720, 0xc42000c0b0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/telegraf/metric.(*metric).Fields(0xc4207e8680, 0xc42091e4e0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:279 +0x5dd
github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).parser(0xc42007a120)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:205 +0x250
created by github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).Start
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:131 +0x62d
danielnelson commented 7 years ago

Closing, fixed in 1.3

itdevon commented 7 years ago

Is there a workaround until 1.3 comes out? The influx service keeps crashing with the same runtime error.

danielnelson commented 7 years ago

1.3 is out :)