Object Allocation Capture at the transaction level

dinsley commented 11 months ago

Is your feature request related to a problem? Please describe.

Better visibility in memory usage and object allocation at the transaction level. More specifically in background jobs, to help track down potentially memory leaks.

Feature Description

It would be amazing if there was a way to get rough object allocation counts in background jobs. Scout has some functionality that hooks into the garbage collectors allocation hooks to do rudimentary counting, and then uses that functionality to provide visibility into potential areas of memory bloat based off a scoring system.

This is obviously something that would need careful consideration as regardless of how tracking is done it can and will have an impact on performance/memory.

Even just having the rough data available to query would be useful without the UI/Site expanded to have functionality around it.

Additional context

Scout's C-extension for providing the functionality:

https://github.com/scoutapp/scout_apm_ruby/blob/master/ext/allocations/allocations.c

I've done a little bit of prototyping with a similar approach in the later versions of Ruby, but am not too experienced with C or any potential issues with supporting functionality like this long term, so I'm probably missing a lot of context.

Priority

Please help us better understand this feature request by choosing a priority from the following options: [Nice to Have, Really Want, Must Have, Blocker]

workato-integration[bot] commented 11 months ago

https://new-relic.atlassian.net/browse/NR-177584

fallwith commented 11 months ago

Hi @dinsley,

Thanks very much for your feature request submission.

With regard to complexity, I recognize this request as touching on at least these 3 distinct areas:

Reporting Ruby virtual machine (VM) statistics
Scoping reported statistics to the transaction level
Enhancing the CRuby VM itself to gather metrics on a per GC event basis

We don't currently touch on the last 2 items on that list. Scout's allocation extension seems really neat.

As for the first one, we do have an offering that is enabled by default (and can be disabled by setting :disable_vm_sampler to true) to report on VM stats.

The New Relic Ruby agent's VM sampler reports on a good number of RubyVM/ segments whose names are referenced as Ruby constants at the top of the VMSampler class. When CRuby (formerly MRI) is in play, the New Relic Ruby agent's MriVM class is used to fetch the values for use with those segments.

Are you currently getting any value out of these existing VM stats that are being reported and would mostly like to have the same stats available down at the individual transaction level, or are you thinking that even if we were to bring those same stats to the transaction level they might be insufficient without dropping into the C layer?

dinsley commented 11 months ago

Thanks for the reply!

We're not getting much value out of those global values, since it's very difficult to correlate or get context on any of the data. It's a nice overview of health, but without additional information, or it scoped down and the metrics snapshot based on the executed transaction, it's definitely hard to integrate into our workflows.

I think if it's possible to get the allocation metrics at the transaction (or custom named) in the future that would help a lot, especially if the count was based on the context of code executed within a block. I think the performance implications of doing it based on context are a lot higher though, so that's why it's being done in C on the Scout example, I'm still doing some digging on that though.

AppSignal also tracks allocations the same way within a context at the 'transaction' level, and utilizes a similar approach, but that looks to be for supporting older Ruby versions:

https://github.com/appsignal/appsignal-ruby/blob/main/ext/appsignal_extension.c#L818

They expose it in a similar way to Scout:

fallwith commented 11 months ago

Thanks, @dinsley! We'll post more here if we have any questions or thoughts to share as the feature request gets evaluated.

dinsley commented 11 months ago

Sounds good! We're doing some prototyping with this approach and are going to be testing it in a few of our environments this week to see how it goes performance + accuracy-wise:

I've put the prototype up here: https://github.com/eventtemple/object_allocation_tracker

If interested in offering this, I'd be happy to abstract out the implementation and submit a PR after some more testing. Currently we're sending the results as metrics for jobs processed and have wrapped the execution for tracking.

I'll add some instructions to the gem after as well if people are interested in sending the data to New Relic in the meantime, after we've verified performance, accuracy, and all that good stuff.

fallwith commented 11 months ago

If interested in offering this

Everything you've mentioned seems quite universally valuable for all Rubyists and we absolutely would be interested in evaluating any related PRs or proposals!

Code in the main agent agent is held to these 3 requirements:

independent of any 3rd party gem dependencies
compatible with as many Ruby runtimes as possible (CRuby, JRuby, TruffleRuby, etc.)
compatible with a minimum Ruby version (currently v2.4)

We have occasionally offered extra functionality that breaks from those aims, and we just provide it as an opt-in experience. For example, our newrelic-infinite_tracing gem requires CRuby, the third party grpc gem, and a higher minimum version of Ruby than what the agent requires (because grpc does). So we just leave it as an optional addition to a user's Gemfile.

Given that precedence, I think you have many options for having code in the New Relic agent be available be default, be opt-in, or be delivered by a third party gem that we reference in documentation.

dinsley commented 11 months ago

Thanks for all that information!

We've moved a branch of the object allocation tracker (https://github.com/eventtemple/object_allocation_tracker/tree/threadsafe) into production to test performance implications (so far negligible) and accuracy, and so far, so good after a few days. I'll give it a week and circle back at looking to open a proposal PR, as there's a few pieces I'd need to look into. I know this implementation wouldn't work for the JRuby runtime, but am unsure on all of the others, and there may be a few changes needed for the minimum Ruby version.

We're currently just manually recording the metrics for background jobs into New Relic, and it has been a huge help to isolate a few problem jobs related to bloat that were difficult to replicate locally. There seems to be some discrepancies with the VM metrics being recorded, but I think it's because those VM metrics are recorded in snapshots of its state, so as allocations happen and are cleaned up, the larger state in-between snapshots could be missed. (which makes sense if I'm correct, but still doing some digging on that side)

fallwith commented 11 months ago

That's a terrific result, @dinsley! Congrats on the production rollout and on it yielding results already!

newrelic / newrelic-ruby-agent