newrelic / go-agent

New Relic Go Agent
Apache License 2.0
775 stars 295 forks source link

emergency memory cleanup of harvester #974

Open nr-swilloughby opened 1 month ago

nr-swilloughby commented 1 month ago

This is a feature addition I was holding back until I got confirmation that it was in fact the needed solution, but I think it may be best to go ahead and put it in as a default-off feature anyway to provide a means to mitigate a memory issue if one emerges with no other apparent cause or solution available and/or as a quick stop-gap solution for the customer until a better solution can be found.

This came out of the work on Issue #906, which was a reported memory leak apparently due to the agent holding onto log data longer than normal, but this only happens to this one application for this one customer under this one set of circumstances (a kubernetes environment) where memory constraints are a real issue.

Since one possibility for why it could be the case that log event data is retained longer than harvest cycles is a problem delivering them to the New Relic back-end collector (since the agent will wait for them to be delivered before dumping them), that might be a situation where a network issue or other external problem could indirectly cause the instrumented application to grow its memory too large to be viable.

And above all, we never want instrumentation of an application to unduly affect the operation of that application itself, so it stands to reason that if an application reaches that point where there seems like no other alternative, we should discard the accumulated event data in the harvester so the app can continue running.

This PR introduces an API call to allow the application to set a maximum heap size for the application. If it exceeds that value, all the harvester's data will be dropped and an emergency garbage collection and memory release will be requested. See the documentation for the function in the deltas for the PR for more details.

nr-swilloughby commented 4 weeks ago

I think we should look at whether we want to allow more control over what memory is released here, since the only case we've found so far seems to be caused by memory issues outside the agent itself, and we're just providing a tool to help an application let go of resources to avoid a worse problem.