Closed mattwwarren closed 7 years ago
Also, this panic may have been referenced on #2488 but the request to open a new issue seems to have never been done.
Can you check if it panics again if you replay the same logs? You will probably need to use from_beginning = true
.
I'm making a instrumented build that will improve the logging here if it panics, it will be a little slower than the 1.3.3 but otherwise the same. Would it be possible for you to use it until you get another crash?
I am trying to replay the same log from a different server with debug logging. I will report back with the results of that.
Since this is crashing pretty regularly for us, I would be happy to run a custom build.
I added this code: https://github.com/influxdata/telegraf/commit/5a56291b0ec0e3e446b442419c909ce0c9cbd03e I think this should be enough to determine if the bug is in logparser or in the metrics code.
Here is an linux amd64 build: https://7429-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.cD2V8ru/telegraf.gz
Thank you. I will try to get that installed (although perhaps tomorrow)
Some additional good news, I was able to recreate the panic on a second machine with the same log and my pasted config.
Great, so long as we can reproduce it then it shouldn't be too hard to fix.
I had just enough time to kick off a run before I left for the day.
panic: Recovered in metric.Fields(); m.fields: "agent=\"Mozilla/5.0 (Linux; Android 5.1; Bush Spira D2 5.5\\\\\" Smartphone Build/LMY47D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36\",resp_code=204i,referrer=\"https://www.wanderu.com/en/tripsummary/GBNXP%2CGBCPIGBNXP%2CGBUBWGBNXP3%2C2017-07-13T01%3A40%3A00%2B01%3A00%2C2017-07-13T02%3A20%3A00%2B01%3A00%2C?query=anonymized\",response_time_us=1133i,client_ip=\"10.20.0.208\",auth=\"-\",ident=\"-\",request=\"/v2/auth.json\",http_version=1.1"
Right on! Thanks for the quick turnaround. I look forward to 1.3.4!
I hope to have it out tomorrow, if you need a build of the 1.3 branch with the fix you can use https://7437-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.XAvjpJX/telegraf.gz
Thanks for the help with this bug!
Bug report
Relevant telegraf.conf:
Apologies if not all of this is relevant.
System info:
Telegraf v1.3.3 (git: release-1.3 46db92aad3e68af04f1732598ac89f6b9b11daf8) Amazon Linux latest
Steps to reproduce:
Expected behavior:
Telegraf does not crash
Actual behavior:
After running for some number of hours, telegraf will crash with
panic: runtime error: slice bounds out of range
Additional info:
I glanced over the code but couldn't quite tell where the error bottomed out. I thought perhaps our custom log pattern was at the source of the issue but the only float field is the HTTP version and they are all 1.1