Open dinsley opened 11 months ago
Hi @dinsley,
Thanks very much for your feature request submission.
With regard to complexity, I recognize this request as touching on at least these 3 distinct areas:
We don't currently touch on the last 2 items on that list. Scout's allocation extension seems really neat.
As for the first one, we do have an offering that is enabled by default (and can be disabled by setting :disable_vm_sampler
to true
) to report on VM stats.
The New Relic Ruby agent's VM sampler reports on a good number of RubyVM/
segments whose names are referenced as Ruby constants at the top of the VMSampler class. When CRuby (formerly MRI) is in play, the New Relic Ruby agent's MriVM class is used to fetch the values for use with those segments.
Are you currently getting any value out of these existing VM stats that are being reported and would mostly like to have the same stats available down at the individual transaction level, or are you thinking that even if we were to bring those same stats to the transaction level they might be insufficient without dropping into the C layer?
Thanks for the reply!
We're not getting much value out of those global values, since it's very difficult to correlate or get context on any of the data. It's a nice overview of health, but without additional information, or it scoped down and the metrics snapshot based on the executed transaction, it's definitely hard to integrate into our workflows.
I think if it's possible to get the allocation metrics at the transaction (or custom named) in the future that would help a lot, especially if the count was based on the context of code executed within a block. I think the performance implications of doing it based on context are a lot higher though, so that's why it's being done in C on the Scout example, I'm still doing some digging on that though.
AppSignal also tracks allocations the same way within a context at the 'transaction' level, and utilizes a similar approach, but that looks to be for supporting older Ruby versions:
https://github.com/appsignal/appsignal-ruby/blob/main/ext/appsignal_extension.c#L818
They expose it in a similar way to Scout:
Thanks, @dinsley! We'll post more here if we have any questions or thoughts to share as the feature request gets evaluated.
Sounds good! We're doing some prototyping with this approach and are going to be testing it in a few of our environments this week to see how it goes performance + accuracy-wise:
I've put the prototype up here: https://github.com/eventtemple/object_allocation_tracker
If interested in offering this, I'd be happy to abstract out the implementation and submit a PR after some more testing. Currently we're sending the results as metrics for jobs processed and have wrapped the execution for tracking.
I'll add some instructions to the gem after as well if people are interested in sending the data to New Relic in the meantime, after we've verified performance, accuracy, and all that good stuff.
If interested in offering this
Everything you've mentioned seems quite universally valuable for all Rubyists and we absolutely would be interested in evaluating any related PRs or proposals!
Code in the main agent agent is held to these 3 requirements:
We have occasionally offered extra functionality that breaks from those aims, and we just provide it as an opt-in experience. For example, our newrelic-infinite_tracing
gem requires CRuby, the third party grpc
gem, and a higher minimum version of Ruby than what the agent requires (because grpc
does). So we just leave it as an optional addition to a user's Gemfile
.
Given that precedence, I think you have many options for having code in the New Relic agent be available be default, be opt-in, or be delivered by a third party gem that we reference in documentation.
Thanks for all that information!
We've moved a branch of the object allocation tracker (https://github.com/eventtemple/object_allocation_tracker/tree/threadsafe) into production to test performance implications (so far negligible) and accuracy, and so far, so good after a few days. I'll give it a week and circle back at looking to open a proposal PR, as there's a few pieces I'd need to look into. I know this implementation wouldn't work for the JRuby runtime, but am unsure on all of the others, and there may be a few changes needed for the minimum Ruby version.
We're currently just manually recording the metrics for background jobs into New Relic, and it has been a huge help to isolate a few problem jobs related to bloat that were difficult to replicate locally. There seems to be some discrepancies with the VM metrics being recorded, but I think it's because those VM metrics are recorded in snapshots of its state, so as allocations happen and are cleaned up, the larger state in-between snapshots could be missed. (which makes sense if I'm correct, but still doing some digging on that side)
That's a terrific result, @dinsley! Congrats on the production rollout and on it yielding results already!
Is your feature request related to a problem? Please describe.
Better visibility in memory usage and object allocation at the transaction level. More specifically in background jobs, to help track down potentially memory leaks.
Feature Description
It would be amazing if there was a way to get rough object allocation counts in background jobs. Scout has some functionality that hooks into the garbage collectors allocation hooks to do rudimentary counting, and then uses that functionality to provide visibility into potential areas of memory bloat based off a scoring system.
This is obviously something that would need careful consideration as regardless of how tracking is done it can and will have an impact on performance/memory.
Even just having the rough data available to query would be useful without the UI/Site expanded to have functionality around it.
Additional context
Scout's C-extension for providing the functionality:
https://github.com/scoutapp/scout_apm_ruby/blob/master/ext/allocations/allocations.c
I've done a little bit of prototyping with a similar approach in the later versions of Ruby, but am not too experienced with C or any potential issues with supporting functionality like this long term, so I'm probably missing a lot of context.
Priority
Please help us better understand this feature request by choosing a priority from the following options: [Nice to Have, Really Want, Must Have, Blocker]