newrelic / go-agent

New Relic Go Agent
Apache License 2.0
776 stars 295 forks source link

internal/jsonx.AppendString causing panic in v3.20.0 #607

Closed nicoroy2561 closed 1 year ago

nicoroy2561 commented 1 year ago

Description

As per title, my production app just ran into a panic using v3.20.0. Downgrading to v3.19.2 for the time being, as that version worked fine last time.

Attaching panic details:

``` panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8c64b9] goroutine 631981 [running]: github.com/newrelic/go-agent/v3/internal/jsonx.AppendString(0xc000b5baf0?, {0x0, 0x3}) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/internal/jsonx/encode.go:25 +0x59 github.com/newrelic/go-agent/v3/newrelic.(*jsonFieldsWriter).stringField(0xc000b5baf0, {0xfb4278?, 0xfb4278?}, {0x0, 0x3}) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/json_object_writer.go:34 +0x45 github.com/newrelic/go-agent/v3/newrelic.agentAttributesJSON(0xc00cb8cb88, 0xc005fe3680, 0x1) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/attributes_from_internal.go:465 +0x177 github.com/newrelic/go-agent/v3/newrelic.(*txnEvent).WriteJSON(0xc00775f380, 0xc005fe3680) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/txn_events.go:55 +0x30c github.com/newrelic/go-agent/v3/newrelic.(*analyticsEvents).CollectorJSON(0xc00e4026f0, {0xc004165800, 0x7a}) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/analytics_events.go:118 +0x256 github.com/newrelic/go-agent/v3/newrelic.(*txnEvents).Data(0xc010437f20?, {0xc004165800?, 0x0?}, {0x0?, 0x1877698?, 0x18a4f40?}) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/txn_events.go:174 +0x25 github.com/newrelic/go-agent/v3/newrelic.(*app).doHarvest(0xc001e2d900, 0xc000b13fd0?, {0xd52c97?, 0xc009744cd2?, 0x19c0f60?}, 0xc001bb8a00) /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/internal_app.go:74 +0x15c created by github.com/newrelic/go-agent/v3/newrelic.(*app).process /home/user/go/pkg/mod/github.com/newrelic/go-agent/v3@v3.20.0/newrelic/internal_app.go:228 +0x3ae ```
iamemilio commented 1 year ago

Thanks for reporting, we will fix this ASAP

iamemilio commented 1 year ago

@nicoroy2561 Would it be possible to provide me with a reproducer?

nicoroy2561 commented 1 year ago

@nicoroy2561 Would it be possible to provide me with a reproducer?

Sorry, the stacktrace I posted is the whole I got from the crash. It's not even long enough for me to tell where it originates from. It's also impossible to reproduce it because it only happened in live, I never encountered it when testing.

iamemilio commented 1 year ago

This is going to be a tricky one. The library that the panic triggered on has not changed for 7 years though, so I doubt the version matters. It must just be an edge case that flew under the radar for quite some time.

iamemilio commented 1 year ago

I was not able to reproduce this, no matter what I tried. I did get some nice additional coverage out of it though 😄. https://github.com/newrelic/go-agent/pull/610. I am totally stumped right now. How often does this error occur in your environment? If it happens again, can you try to capture any additional information that could give us a clue as to why this might be happening?

nicoroy2561 commented 1 year ago

Hi, sorry for the late response.

Answering your questions first, I got this error only once on my live server. If what you're saying is true and the library has never changed over different versions, that makes me fear that it might just randomly happen again.

I took some time to properly look both at the code causing the panic and the stack trace, and I noticed something that seems quite weird to me: github.com/newrelic/go-agent/v3/internal/jsonx.AppendString(0xc000b5baf0?, {0x0, 0x3})

The very first line of the stacktrace is just... weird. Leaving aside the '?' which apparently is normal, the next values are for the string's underlying struct. And they show a 0x3 length value, but a nil pointer 0x0. This makes absolutely no sense as it shouldn't be possible to have a string object that has a length bigger than the underlying []byte (array?). And that's likely what's causing the issue, because I kinda expect len() to only look at the length int and not check the actual pointer.

Since the problematic string originates from func agentAttributesJSON(a *attributes, buf *bytes.Buffer, d destinationSet) which takes the string from a.Agent, of type map[string]agentAttributeValue, it makes me think that there may be something somewhere either using Unsafe or using c code to set one such erroneous value within that map.

nicoroy2561 commented 1 year ago

Perhaps a recover could be added to func (app *app) doHarvest(h *harvest, harvestStart time.Time, run *appRun) so that when it's spawned by func (app *app) process() it doesn't crash the whole application, but instead skips harvesting data that single time.

I'd much rather get the stacktrace and a message like "New Relic data harvest failed" than a app-wide panic, especially if it can then potentially go on harvesting data later (skipping the problematic one?).

iamemilio commented 1 year ago

Yeah, putting a recover would be a good stopgap. Hmm, let me double check everything. That is a good point about the nil array, that might be a good lead for us, but I am not sure that we use the unsafe library for anything. I'll have to double check that.

nicoroy2561 commented 1 year ago

If it can be of any help, I am still not experiencing this issue after downgrading to v3.19.2. It might be that the new code added after that is totally safe and unrelated and this is just luck. But if I had to start somewhere, the newly changed code is where I'd start looking into.

iamemilio commented 1 year ago

We have not made any changes to that code path for quite a while. I am not sure if your library is doing any unsafe pointer modification, but I have not been able to find a culprit in the go agent yet. We will release a panic recover that will protect the harvest process from crashing fatally in this weeks release, and we recommend you upgrade to it. I'll keep this issue in the back of my mind, but I have to time box this investigation for the time being... If you get a log message reporting a panic in the new release, please @ me here and we will re-asses things. Harvest panics should no longer crash your application, but will result in that harvest's data getting dropped.

Leaving this open: mitigated, not resolved.

nicoroy2561 commented 1 year ago

I just checked and couldn't find any unsafe usage in my application (and dependecies). I'm sure there's gotta be something somewhere. Either that, or I have zero idea on how it's possible to generate one such weird data structure.

I'll let you know if I get any other stack traces. Will update the lib as soon as I see the commit :)

Thank you for the help!

iamemilio commented 1 year ago

No problem, this is one of the most bizarre bugs I have seen in a long time haha. I hope it gets resolved and thanks for all the help! There will be a release this week with the patch, guaranteed.