Closed netrebel closed 2 years ago
Thank you for bringing this to our attention. We're looking into this and what the best way to mitigate this would be until the memory leak in http.Transport
is fixed.
@netrebel Thanks for reporting.
@nr-swilloughby even we encountered this issue recently once we moved our project from go 1.13 to 1.15.
We've managed to mitigate the issue by overriding the transaction pointer with an empty struct:
defer func() {
txn.End()
*txn = newrelic.Transaction{}
}()
Results from a high-volume service:
@netrebel Based on the PR description, https://github.com/golang/go/pull/50799 is resolving a leak that occurs when interacting with an arbitrary number of hostnames. But (I assume?) this library would only be interacting with one hostname. So that might be a red herring. One way to confirm would be to take the change from https://github.com/golang/go/pull/50799 and manually apply it to your standard library, and see if it fixes your leak.
Is there an update on this issue? We've managed to mitigate the memory footprint as describe earlier. However, we're noticing CPU increase linearly over time on a service that processes periodic tasks. We tested that removing the following line of code resolves the issue:
ctx = newrelic.NewContext(ctx, txn)
Graph showing the before and after effect of the change:
Interesting data re: CPU increase. We'll look into that as well as the mitigation on the memory issue ASAP.
Hi Team,
Do we have any update on this issue? We are also facing the same problem.
TBH, we expected Golang to have patched this a long time ago. We will re-asses and re-prioritize this though, since it seems to not be going away any time soon.
We are using go 1.18 with github.com/newrelic/go-agent/v3 v3.17.0 and seeing the OOM error.
We created a utility function to create the newrelic segments within transaction.
func logNewRelicSegment(ctx context.Context, query string, sqlStatement string, dbconfig dbConfig) *newrelic.DatastoreSegment {
//logging should be non-disruptive. any error during this is absorbed
defer func() {
if r := recover(); r != nil {
log.Debug(ctx, fmt.Sprintf("Panic absorbed: %v", r))
}
}()
txn := newrelic.FromContext(ctx)
operation := extractOperation(ctx, query)
collection := extractTableName(ctx, operation, query)
seg := &newrelic.DatastoreSegment{
StartTime: txn.StartSegmentNow(),
Product: newrelic.DatastorePostgres,
Collection: collection,
Operation: operation,
ParameterizedQuery: sqlStatement,
Host: dbconfig.Host,
PortPathOrID: strconv.Itoa(dbconfig.Port),
DatabaseName: dbconfig.Name,
}
return seg
}
The segment is closed by caller function as below
//new relic segment here
seg := logNewRelicSegment(ctx, query, sqlStatement, txn.Conn.DBConfig)
defer seg.End()
@iamemilio I wanted to bring this to your attention, that we are using the latest version 1.18 and newrelic go agent v3.17.0
Thanks for the additional information.
We've managed to mitigate the issue by overriding the transaction pointer with an empty struct:
defer func() { txn.End() *txn = newrelic.Transaction{} }()
Results from a high-volume service:
My thought on this at first glance, is that the agent should not destroy transaction objects, even after they have been ended. There may be metadata or info that users want to gather from a completed transaction. So, if you create a bunch of transactions, they will retain their data if they are not deallocated. I would not consider this specific case a memory leak. I think your solution is appropriate for your specific usage pattern though. Would users want an explicit function that can be called to deallocate transaction objects? Something like txn.Close()
?
cc @jeffyanta
We are still looking into resolving the memory leak regarding the http.Transport object, and digging into the cpu issues you were seeing when calling ctx = newrelic.NewContext(ctx, txn)
.
My thought on this at first glance, is that the agent should not destroy transaction objects, even after they have been ended. There may be metadata or info that users want to gather from a completed transaction. So, if you create a bunch of transactions, they will retain their data if they are not deallocated.
@iamemilio Why is the garbage collector alone not reclaiming memory once the transaction falls out of scope?
Is that what is happening? I would need to see the way the instrumentation code is written in your application to verify that. On the surface, it doesn't seem like we're doing anything to make that data persist, but I would need more info to verify that. When a transaction is ended, the data we collect for a transaction is finalized and dumped into a cache that is drained periodically when data payloads are sent to our servers. That is the only thing calling txn.End()
does. If the transaction object pointer stays in scope, it may not be able to be safely deallocated by the garbage collector, and we do not explicitly deallocate it for the reasons outlined above.
The same logic applies to contexts as well. The logic for calling newrelic.NewContext()
is not really doing anything crazy, it is just adding a key value pair with a reference to a transaction pointer to a context. The context should be owned and managed by the application though, and it is 100% up to the caller of NewContext()
to make sure the context gets closed.
// NewContext returns a new context.Context that carries the provided
// transaction.
func NewContext(ctx context.Context, txn *Transaction) context.Context {
return context.WithValue(ctx, internal.TransactionContextKey, txn)
}
Hi @iamemilio thanks for giving this a bit more attention but I’m not sure this is related to anything particular on the end user side.
There are multiple codebase reporting this problem so it appears to be something fundamental rather than something about how an individual codebase is using the newrelic client.
Agreed, I am definitely not saying that there is not a memory leak. We know that the issue with http.Transport is impacting us. We are going to address that first, and then we can loop back and see if the other issues persist.
I do want to rule out the usage pitfalls I outlined above though.
We were able to capture a slow linear growth in memory on the order of 0.2 Mb per hour in our own testing when running the v3/examples/server under simulated load overnight, and hitting the root endpoint. This creates a transaction and then injects it into a context with a wrapped http handler function. We are going to look into it further.
Folks who were seeing very drastic memory growth, I have not been able to reproduce that yet. Would you be willing to provide a reproducer?
@iamemilio We had a support case as well created with newrelic. Please let me know if we can have working session. We can try reproducing this issue again.
Yes, we can set that up. Please ask the support engineer who is handling your case to set a call up with the go-team, and we can take it from there! Thanks.
This issue is now being prioritized by our team; to investigate the problem further and implement a solution.
Hi all, Thanks so much for working with us on this issue, and being patient. We worked along side one of the original reporters of the issue, and were able to completely remediate it in their environment. We wanted to post a formal writeup here to explain what the problem was, how we fixed it, and what the next troubleshooting steps for all of you should be.
Here is a very simplified pseudo-code example of what caused the issue.
func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
ctx := context.TODO() (1)
for {
select {
case <- ctx.Done():
break
case <- job:
doDataTransaction(ctx, app, job) (2)
}
}
}
func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
txn := app.StartTransaction("my data transaction")
ctx = newrelic.NewContext(ctx, txn) (3)
for _, dataOperation := range job.Batch() {
doBusinessLogicSegment(ctx, db, dataOperation)
}
txn.End() (4)
}
(1) A context is created to keep track of a flow of execution. This context is created at a very high code scope and exists outside the scope where transactions get created.
(2) This context gets passed into each function that executes a transaction with some business logic on a datastore.
(3) The doDataTransaction
function creates a transaction, then calls NewContext to write the transaction pointer to a context. This function is intentionally named NewContext
because it calls context.WithValue() which creates a copy of the original context with an added key value pair containing a pointer to the transaction object. By replacing the original context with this newly created context containing the transaction pointer, we are effectively passing the transaction up the stack to the mainLoop()
.
(4) The business logic is executed and the transaction is ended. This does not explicitly delete the transaction data on heap by design, in case the metadata needs to be accessed after the transaction ends.
The assumption being made is that the transaction is out of scope once it is ended in doDataTransaction
, and it will be garbage collected. This is not the case though. Remember that in step 3, we added a pointer to the transaction data into our context which is part of the mainLoop()
of our application. As long as the context object in the main loop is in scope, the go garbage collector sees that a variable is still in scope that references the data for a given transaction and is unable to delete its memory in heap.
Canonically, a context should be used as a way to pass around "common scoped" data within an application. It's important to be aware of the fact that data can never be removed from contexts, and that data stored inside a context is in scope as long as that context object is also in scope. For these reasons, New Relic recommends against storing transaction data on context objects that exist outside the scope of a transaction.
If using contexts is important to the flow of your application, we recommend that you derive a new context object for each transaction. This will propagate the original context through the chain of function calls safely, without leaking transaction data. The Code that is out of the scope of the current transaction does not need to access this transaction's data, so writing it to a context object that is out of scope does not make sense.
func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
ctx := context.TODO()
for {
select {
case <- ctx.Done():
break
case <- job:
doDataTransaction(ctx, app, job)
}
}
}
func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
txn := app.StartTransaction("my data transaction")
// derive a new context to pass common data to code handling segments of this transaction
txnContext := newrelic.NewContext(ctx, txn)
for _, dataOperation := range job.Batch() {
doBusinessLogicSegment(txnContext, db, dataOperation)
}
txn.End()
}
If your app does not access the transaction data from a context, then it really doesn't belong there in the first place. In this case you can just pass the pointer down through function arguments.
func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
ctx := context.TODO()
for {
select {
case <- ctx.Done():
break
case <- job:
doDataTransaction(ctx, app, job)
}
}
}
func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
txn := app.StartTransaction("my data transaction")
for _, dataOperation := range job.Batch() {
// pass the transaction pointer directly to the functions handling segments of this transaction
doBusinessLogicSegment(ctx, db, dataOperation, txn)
}
txn.End()
}
New Relic strongly recommends following the best practices outlined above. We understand that some users may assume that txn.End()
removes all traces of transaction data when in fact it only signals the end of the transaction. The behavior of txn.End()
is designed to retain that data and that behavior needs to be preserved, however a possible future enhancement could be to add another method, such as txn.Close()
which can be called to remove the data explicitly. If this is something the community enthusiastically wants, the go team can consider it but there are trade offs with this approach too. It seems like the go authors thought of this exact scenario when designing the context library, and it's probably best to follow the guidelines and best practices they laid out:
Incoming requests to a server should create a Context, and outgoing calls to servers should accept a Context. The chain of function calls between them must propagate the Context, optionally replacing it with a derived Context created using WithCancel, WithDeadline, WithTimeout, or WithValue. When a Context is canceled, all Contexts derived from it are also canceled.
The WithCancel, WithDeadline, and WithTimeout functions take a Context (the parent) and return a derived Context (the child) and a CancelFunc. Calling the CancelFunc cancels the child and its children, removes the parent's reference to the child, and stops any associated timers. Failing to call the CancelFunc leaks the child and its children until the parent is canceled or the timer fires. The go vet tool checks that CancelFuncs are used on all control-flow paths.
We are going to consider this resolved for now. If you have gone through the troubleshooting steps outlined above, and are still having this problem in your environment, please re-open this ticket with information about your specific usage pattern and what you observed. Thanks!
Description
We started seeing a memory leak issue when we upgraded from Go 1.11.3 (no modules) to 1.17 (with go-modules). In the following graph memory utilization can be seen growing right after the upgrade on Dec 10. We had been running the agent (github.com/newrelic/go-agent/v3 v3.15.2) in Go 1.11.3 without any issues.
This is a high throughput service, the aggregated throughput of all our instances is:
When we ran
pprof
on one of the instances, we saw something similar to the comment in go/issues/43966:The workaround seems to be to create a new
http.Transport
each time. Looking at the code the problem could be thatcollectorDefaultTransport
is a Global variable and therefore reused. https://github.com/newrelic/go-agent/blob/master/v3/newrelic/transport.go#L18After removing the New Relic agent from our application, we confirmed that the memory leak is gone.
Steps to Reproduce
We are not able to reproduce this in our development environment due to the low traffic there. But in theory, this could be reproduced in an application with a similar load running on Go 1.17 and NR agent version
github.com/newrelic/go-agent/v3 v3.15.2
Expected Behavior
Memory utilization should remain constant
Your Environment
Running on Ubuntu:
Additional context
There's a PR that potentially addresses this issue in Go: https://github.com/golang/go/pull/50799
But other libraries we use are not using a Global variable for
http.Transport
and therefore do not trigger the memory leak. I don't know if that is the standard way to work withhttp.Transport