newrelic / go-agent

New Relic Go Agent
Apache License 2.0
768 stars 296 forks source link

Memory leak in Go 1.17 (http.Transport) #447

Closed netrebel closed 2 years ago

netrebel commented 2 years ago

Description

We started seeing a memory leak issue when we upgraded from Go 1.11.3 (no modules) to 1.17 (with go-modules). In the following graph memory utilization can be seen growing right after the upgrade on Dec 10. We had been running the agent (github.com/newrelic/go-agent/v3 v3.15.2) in Go 1.11.3 without any issues.

mem-leak

This is a high throughput service, the aggregated throughput of all our instances is: image

When we ran pprof on one of the instances, we saw something similar to the comment in go/issues/43966:

      flat  flat%   sum%        cum   cum%
    1.50MB  5.07%  5.07%    12.04MB 40.70%  net/http.(*Transport).dialConn
         0     0%  5.07%    12.04MB 40.70%  net/http.(*Transport).dialConnFor
    8.03MB 27.14% 32.21%     8.03MB 27.14%  bufio.NewWriterSize (inline)
    6.02MB 20.36% 52.57%     6.02MB 20.36%  bufio.NewReaderSize (inline)

The workaround seems to be to create a new http.Transport each time. Looking at the code the problem could be that collectorDefaultTransport is a Global variable and therefore reused. https://github.com/newrelic/go-agent/blob/master/v3/newrelic/transport.go#L18

After removing the New Relic agent from our application, we confirmed that the memory leak is gone.

Steps to Reproduce

We are not able to reproduce this in our development environment due to the low traffic there. But in theory, this could be reproduced in an application with a similar load running on Go 1.17 and NR agent version github.com/newrelic/go-agent/v3 v3.15.2

Expected Behavior

Memory utilization should remain constant

Your Environment

Running on Ubuntu:

Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:   trusty

Additional context

There's a PR that potentially addresses this issue in Go: https://github.com/golang/go/pull/50799

But other libraries we use are not using a Global variable for http.Transport and therefore do not trigger the memory leak. I don't know if that is the standard way to work with http.Transport

nr-swilloughby commented 2 years ago

Thank you for bringing this to our attention. We're looking into this and what the best way to mitigate this would be until the memory leak in http.Transport is fixed.

rajuGT commented 2 years ago

@netrebel Thanks for reporting.

@nr-swilloughby even we encountered this issue recently once we moved our project from go 1.13 to 1.15.

image
jeffyanta commented 2 years ago

We've managed to mitigate the issue by overriding the transaction pointer with an empty struct:

defer func() {
    txn.End()
    *txn = newrelic.Transaction{}
}()

Results from a high-volume service:

Screen Shot 2022-02-15 at 8 56 11 PM
rittneje commented 2 years ago

@netrebel Based on the PR description, https://github.com/golang/go/pull/50799 is resolving a leak that occurs when interacting with an arbitrary number of hostnames. But (I assume?) this library would only be interacting with one hostname. So that might be a red herring. One way to confirm would be to take the change from https://github.com/golang/go/pull/50799 and manually apply it to your standard library, and see if it fixes your leak.

jeffyanta commented 2 years ago

Is there an update on this issue? We've managed to mitigate the memory footprint as describe earlier. However, we're noticing CPU increase linearly over time on a service that processes periodic tasks. We tested that removing the following line of code resolves the issue:

ctx = newrelic.NewContext(ctx, txn)

Graph showing the before and after effect of the change:

Screen Shot 2022-03-09 at 10 55 36 AM
nr-swilloughby commented 2 years ago

Interesting data re: CPU increase. We'll look into that as well as the mitigation on the memory issue ASAP.

saurabhjain940 commented 2 years ago

Hi Team,

Do we have any update on this issue? We are also facing the same problem.

iamemilio commented 2 years ago

TBH, we expected Golang to have patched this a long time ago. We will re-asses and re-prioritize this though, since it seems to not be going away any time soon.

sandip-bhoi-od commented 2 years ago

We are using go 1.18 with github.com/newrelic/go-agent/v3 v3.17.0 and seeing the OOM error.

image

We created a utility function to create the newrelic segments within transaction.

func logNewRelicSegment(ctx context.Context, query string, sqlStatement string, dbconfig dbConfig) *newrelic.DatastoreSegment {
    //logging should be non-disruptive. any error during this is absorbed
    defer func() {
        if r := recover(); r != nil {
            log.Debug(ctx, fmt.Sprintf("Panic absorbed: %v", r))
        }
    }()
    txn := newrelic.FromContext(ctx)
    operation := extractOperation(ctx, query)
    collection := extractTableName(ctx, operation, query)
    seg := &newrelic.DatastoreSegment{
        StartTime:          txn.StartSegmentNow(),
        Product:            newrelic.DatastorePostgres,
        Collection:         collection,
        Operation:          operation,
        ParameterizedQuery: sqlStatement,
        Host:               dbconfig.Host,
        PortPathOrID:       strconv.Itoa(dbconfig.Port),
        DatabaseName:       dbconfig.Name,
    }
    return seg
}

The segment is closed by caller function as below

//new relic segment here
    seg := logNewRelicSegment(ctx, query, sqlStatement, txn.Conn.DBConfig)
    defer seg.End()
sandip-bhoi-od commented 2 years ago

@iamemilio I wanted to bring this to your attention, that we are using the latest version 1.18 and newrelic go agent v3.17.0

nr-swilloughby commented 2 years ago

Thanks for the additional information.

iamemilio commented 2 years ago

We've managed to mitigate the issue by overriding the transaction pointer with an empty struct:

defer func() {
  txn.End()
  *txn = newrelic.Transaction{}
}()

Results from a high-volume service: Screen Shot 2022-02-15 at 8 56 11 PM

My thought on this at first glance, is that the agent should not destroy transaction objects, even after they have been ended. There may be metadata or info that users want to gather from a completed transaction. So, if you create a bunch of transactions, they will retain their data if they are not deallocated. I would not consider this specific case a memory leak. I think your solution is appropriate for your specific usage pattern though. Would users want an explicit function that can be called to deallocate transaction objects? Something like txn.Close() ?

cc @jeffyanta

iamemilio commented 2 years ago

We are still looking into resolving the memory leak regarding the http.Transport object, and digging into the cpu issues you were seeing when calling ctx = newrelic.NewContext(ctx, txn).

rittneje commented 2 years ago

My thought on this at first glance, is that the agent should not destroy transaction objects, even after they have been ended. There may be metadata or info that users want to gather from a completed transaction. So, if you create a bunch of transactions, they will retain their data if they are not deallocated.

@iamemilio Why is the garbage collector alone not reclaiming memory once the transaction falls out of scope?

iamemilio commented 2 years ago

Is that what is happening? I would need to see the way the instrumentation code is written in your application to verify that. On the surface, it doesn't seem like we're doing anything to make that data persist, but I would need more info to verify that. When a transaction is ended, the data we collect for a transaction is finalized and dumped into a cache that is drained periodically when data payloads are sent to our servers. That is the only thing calling txn.End() does. If the transaction object pointer stays in scope, it may not be able to be safely deallocated by the garbage collector, and we do not explicitly deallocate it for the reasons outlined above.

The same logic applies to contexts as well. The logic for calling newrelic.NewContext() is not really doing anything crazy, it is just adding a key value pair with a reference to a transaction pointer to a context. The context should be owned and managed by the application though, and it is 100% up to the caller of NewContext() to make sure the context gets closed.

// NewContext returns a new context.Context that carries the provided
// transaction.
func NewContext(ctx context.Context, txn *Transaction) context.Context {
    return context.WithValue(ctx, internal.TransactionContextKey, txn)
}
zfedoran commented 2 years ago

Hi @iamemilio thanks for giving this a bit more attention but I’m not sure this is related to anything particular on the end user side.

There are multiple codebase reporting this problem so it appears to be something fundamental rather than something about how an individual codebase is using the newrelic client.

iamemilio commented 2 years ago

Agreed, I am definitely not saying that there is not a memory leak. We know that the issue with http.Transport is impacting us. We are going to address that first, and then we can loop back and see if the other issues persist.

I do want to rule out the usage pitfalls I outlined above though.

iamemilio commented 2 years ago

We were able to capture a slow linear growth in memory on the order of 0.2 Mb per hour in our own testing when running the v3/examples/server under simulated load overnight, and hitting the root endpoint. This creates a transaction and then injects it into a context with a wrapped http handler function. We are going to look into it further.

Screen Shot 2022-08-25 at 10 56 14 AM
iamemilio commented 2 years ago

Folks who were seeing very drastic memory growth, I have not been able to reproduce that yet. Would you be willing to provide a reproducer?

sandip-bhoi-od commented 2 years ago

@iamemilio We had a support case as well created with newrelic. Please let me know if we can have working session. We can try reproducing this issue again.

iamemilio commented 2 years ago

Yes, we can set that up. Please ask the support engineer who is handling your case to set a call up with the go-team, and we can take it from there! Thanks.

Ak-x commented 2 years ago

This issue is now being prioritized by our team; to investigate the problem further and implement a solution.

iamemilio commented 2 years ago

Memory Leak Root Cause Analysis for Datastore Segments

Hi all, Thanks so much for working with us on this issue, and being patient. We worked along side one of the original reporters of the issue, and were able to completely remediate it in their environment. We wanted to post a formal writeup here to explain what the problem was, how we fixed it, and what the next troubleshooting steps for all of you should be.

Root Cause

Here is a very simplified pseudo-code example of what caused the issue.

func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
    ctx :=  context.TODO()                        (1)

    for {
        select {
        case <- ctx.Done():
           break
        case <- job:
           doDataTransaction(ctx, app, job)       (2)
        }
    }
}

func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
   txn := app.StartTransaction("my data transaction")
   ctx = newrelic.NewContext(ctx, txn)             (3)

   for _, dataOperation := range job.Batch() {
       doBusinessLogicSegment(ctx, db, dataOperation)
   }

   txn.End()                                       (4)
}

(1) A context is created to keep track of a flow of execution. This context is created at a very high code scope and exists outside the scope where transactions get created.

(2) This context gets passed into each function that executes a transaction with some business logic on a datastore.

(3) The doDataTransaction function creates a transaction, then calls NewContext to write the transaction pointer to a context. This function is intentionally named NewContext because it calls context.WithValue() which creates a copy of the original context with an added key value pair containing a pointer to the transaction object. By replacing the original context with this newly created context containing the transaction pointer, we are effectively passing the transaction up the stack to the mainLoop().

(4) The business logic is executed and the transaction is ended. This does not explicitly delete the transaction data on heap by design, in case the metadata needs to be accessed after the transaction ends.

Summary

The assumption being made is that the transaction is out of scope once it is ended in doDataTransaction, and it will be garbage collected. This is not the case though. Remember that in step 3, we added a pointer to the transaction data into our context which is part of the mainLoop() of our application. As long as the context object in the main loop is in scope, the go garbage collector sees that a variable is still in scope that references the data for a given transaction and is unable to delete its memory in heap.

Solution Space

Canonically, a context should be used as a way to pass around "common scoped" data within an application. It's important to be aware of the fact that data can never be removed from contexts, and that data stored inside a context is in scope as long as that context object is also in scope. For these reasons, New Relic recommends against storing transaction data on context objects that exist outside the scope of a transaction.

If using contexts is important to the flow of your application, we recommend that you derive a new context object for each transaction. This will propagate the original context through the chain of function calls safely, without leaking transaction data. The Code that is out of the scope of the current transaction does not need to access this transaction's data, so writing it to a context object that is out of scope does not make sense.

func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
    ctx :=  context.TODO()                        

    for {
        select {
        case <- ctx.Done():
           break
        case <- job:
           doDataTransaction(ctx, app, job)       
        }
    }
}

func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
   txn := app.StartTransaction("my data transaction")

   // derive a new context to pass common data to code handling segments of this transaction
   txnContext := newrelic.NewContext(ctx, txn)        

   for _, dataOperation := range job.Batch() {
       doBusinessLogicSegment(txnContext, db, dataOperation)
   }

   txn.End()                                       
}

If your app does not access the transaction data from a context, then it really doesn't belong there in the first place. In this case you can just pass the pointer down through function arguments.

func mainLoop(app *newrelic.Application, db dbConfig, job chan Job) {
    ctx :=  context.TODO()                        

    for {
        select {
        case <- ctx.Done():
           break
        case <- job:
           doDataTransaction(ctx, app, job)       
        }
    }
}

func doDataTransaction(ctx context.Context, app *newrelic.Application, db dbConfig, job Job)
   txn := app.StartTransaction("my data transaction")          

   for _, dataOperation := range job.Batch() {
        // pass the transaction pointer directly to the functions handling segments of this transaction
        doBusinessLogicSegment(ctx, db, dataOperation, txn)
   }

   txn.End()                                       
}

New Relic strongly recommends following the best practices outlined above. We understand that some users may assume that txn.End() removes all traces of transaction data when in fact it only signals the end of the transaction. The behavior of txn.End() is designed to retain that data and that behavior needs to be preserved, however a possible future enhancement could be to add another method, such as txn.Close() which can be called to remove the data explicitly. If this is something the community enthusiastically wants, the go team can consider it but there are trade offs with this approach too. It seems like the go authors thought of this exact scenario when designing the context library, and it's probably best to follow the guidelines and best practices they laid out:

Incoming requests to a server should create a Context, and outgoing calls to servers should accept a Context. The chain of function calls between them must propagate the Context, optionally replacing it with a derived Context created using WithCancel, WithDeadline, WithTimeout, or WithValue. When a Context is canceled, all Contexts derived from it are also canceled.

The WithCancel, WithDeadline, and WithTimeout functions take a Context (the parent) and return a derived Context (the child) and a CancelFunc. Calling the CancelFunc cancels the child and its children, removes the parent's reference to the child, and stops any associated timers. Failing to call the CancelFunc leaks the child and its children until the parent is canceled or the timer fires. The go vet tool checks that CancelFuncs are used on all control-flow paths.

iamemilio commented 2 years ago

We are going to consider this resolved for now. If you have gone through the troubleshooting steps outlined above, and are still having this problem in your environment, please re-open this ticket with information about your specific usage pattern and what you observed. Thanks!