service crash with panic: slice bounds out of range

mattwwarren commented 7 years ago

Bug report

Relevant telegraf.conf:

Apologies if not all of this is relevant.

[global_tags]
  env = "prod"

[agent]
  interval = "10s"
  round_interval = false
  metric_batch_size = 2000
  metric_buffer_limit = 6000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "3s"
  precision = ""
  debug = false
  quiet = false
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = [
            "http://myinflux:8086",
              ]
  database = "api"
  retention_policy = ""
    namedrop = ["ops_*", "logparser*"]
    write_consistency = "any"
  timeout = "5s"

[[outputs.influxdb]]
  urls = [
            "http://myinflux:8086",
              ]
  database = "telegraf_prod"
  retention_policy = "telegraf_prod_default"
  namepass = ["ops_*", "logparser*"]
  write_consistency = "any"
  timeout = "5s"

[[inputs.logparser]]
  files = [
    "/var/log/httpd/*access.log",
    "/var/log/httpd/*combined.log",
    "/var/log/httpd/*forwarded.log",
  ]

  [inputs.logparser.grok]
  patterns = [
    "%{COMBINED_LOG_FORMAT_WITH_RESPONSE_TIME_US}"
  ]

### service input plugins ###
  custom_patterns = '''
  COMBINED_LOG_FORMAT_WITH_RESPONSE_TIME_US %{CLIENT:client_ip} %{NOTSPACE:ident} %{NOTSPACE:auth} \[%{HTTPDATE:ts:ts-httpd}\] "(?:%{WORD:verb:tag} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version:float})?|%{DATA})" %{NUMBER:resp_code:int} (?:%{NUMBER:resp_bytes:int}|-) %{QS:referrer} %{QS:agent} %{NUMBER:response_time_us:int}
  '''

System info:

Telegraf v1.3.3 (git: release-1.3 46db92aad3e68af04f1732598ac89f6b9b11daf8) Amazon Linux latest

Steps to reproduce:

Set up a high volume application with apache logs running through telegraf
Wait?

Expected behavior:

Telegraf does not crash

Actual behavior:

After running for some number of hours, telegraf will crash with panic: runtime error: slice bounds out of range

Additional info:

goroutine 30 [running]:
github.com/influxdata/telegraf/metric.(*metric).Fields(0xc420789800, 0xc420e36fe0)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:318 +0x5c8
github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).parser(0xc420328000)
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:224 +0x25f
created by github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).Start
        /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:129 +0x56d

I glanced over the code but couldn't quite tell where the error bottomed out. I thought perhaps our custom log pattern was at the source of the issue but the only float field is the HTTP version and they are all 1.1

# cat /var/log/httpd/api.access.log | awk -F'HTTP/' '{print $2}' | awk -F'"' '{print $1}' | sort | uniq -c
 954368 1.1

mattwwarren commented 7 years ago

Also, this panic may have been referenced on #2488 but the request to open a new issue seems to have never been done.

danielnelson commented 7 years ago

Can you check if it panics again if you replay the same logs? You will probably need to use from_beginning = true.

danielnelson commented 7 years ago

I'm making a instrumented build that will improve the logging here if it panics, it will be a little slower than the 1.3.3 but otherwise the same. Would it be possible for you to use it until you get another crash?

mattwwarren commented 7 years ago

I am trying to replay the same log from a different server with debug logging. I will report back with the results of that.

Since this is crashing pretty regularly for us, I would be happy to run a custom build.

danielnelson commented 7 years ago

I added this code: https://github.com/influxdata/telegraf/commit/5a56291b0ec0e3e446b442419c909ce0c9cbd03e I think this should be enough to determine if the bug is in logparser or in the metrics code.

Here is an linux amd64 build: https://7429-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.cD2V8ru/telegraf.gz

mattwwarren commented 7 years ago

Thank you. I will try to get that installed (although perhaps tomorrow)

Some additional good news, I was able to recreate the panic on a second machine with the same log and my pasted config.

danielnelson commented 7 years ago

Great, so long as we can reproduce it then it shouldn't be too hard to fix.

mattwwarren commented 7 years ago

I had just enough time to kick off a run before I left for the day.

panic: Recovered in metric.Fields(); m.fields: "agent=\"Mozilla/5.0 (Linux; Android 5.1; Bush Spira D2 5.5\\\\\" Smartphone Build/LMY47D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36\",resp_code=204i,referrer=\"https://www.wanderu.com/en/tripsummary/GBNXP%2CGBCPIGBNXP%2CGBUBWGBNXP3%2C2017-07-13T01%3A40%3A00%2B01%3A00%2C2017-07-13T02%3A20%3A00%2B01%3A00%2C?query=anonymized\",response_time_us=1133i,client_ip=\"10.20.0.208\",auth=\"-\",ident=\"-\",request=\"/v2/auth.json\",http_version=1.1"

mattwwarren commented 7 years ago

Right on! Thanks for the quick turnaround. I look forward to 1.3.4!

danielnelson commented 7 years ago

I hope to have it out tomorrow, if you need a build of the 1.3 branch with the fix you can use https://7437-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.XAvjpJX/telegraf.gz

Thanks for the help with this bug!

influxdata / telegraf