Migrate Rails instrumentation to use only ActiveSupport::Notifications

ahayworth commented 1 year ago

We recently uncovered an issue wherein our action_pack instrumentation caused a production problem. There are many ways to consider such an issue, but one way is to acknowledge that monkey-patching is fraught with peril, and that we should consider different methods of instrumentation.

For Rails, we should be able to use ActiveSupport::Notifications. We already use this to great effect in our action_view instrumentation, and we think we can expand this further to the rest of our Rails instrumentation.

To do this, we need to:

[x] Verify that Rails 7 implements safer subscription start/end behavior (a crashing subscriber should not prevent us from finishing our spans correctly).
[ ] Audit the instrumentation we have for Rails, and figure out what ActiveSupport notifications could provide similar levels of instrumentation:
- [x] action_pack
- process_action.action_controller is probably the right one.
- payload.response and payload.status will be set when the notification finishes
- This is defined in actionpack/lib/action_controller/metal/instrumentation.rb
- [x] action_view
- This one is already done 🎉
- [x] active_job
- We can't drop our patch for serializing metadata regardless of what we choose to do here.
- We implement our enqueue instrumentation as an around_enqueue filter; which is precisely what Rails does.
- perform.active_job will likely cover the rest.
- [ ] active_record
- This was one of the things preventing us from doing notifications the first time around: we only get sql.active_record and instantiation.active_record.
- sql.active_record is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.
- instantiation.active_record will capture callbacks, but it's pretty basic and still doesn't tell us what method we're really in.
- [x] active_support
- This is already done, because it doesn't do anything on its own! 🎉
- It's a base for using notifications for spans, which is precisely what we want.
- Basically, previous contributors thought ahead to a future where it would be useful to have a generic "subscribe, make a span" kind of thing. Yay!
[ ] Determine how many (if any) notifications are generated entirely after the instrumented operation is completed, and how many are generated around the instrumented operation.
- This has implications around accessing the current_span, which may be a deal-breaker.
[ ] Augment our custom span subscriber in any way necessary to support a more broad use-case than what it does now
- We need to figure out if there is a better way to do this in Rails 7 and if we should adjust this span subscriber.
- We also should look and see if we could backport the Rails 7 fixes here ... or get that backported upstream.
[ ] Work through instrumentation and convert them.
- [ ] Depending on the outcome of the safety/backport investigations, we can decide whether or not we wish to keep the monkey-patched versions around for older Rails installations or not.

ahayworth commented 1 year ago

This is the PR that improved safety in Rails 7: https://github.com/rails/rails/pull/43282

plantfansam commented 1 year ago

I think this is a great idea. Even if we discover issues with ASN, that sounds like a great reason to contribute fixes upstream (to Rails) 😄

richardmcmillen commented 1 year ago

This is cool to see!

I noted in my MR here https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/204 that the ActionPack instrumentation could probably be refactored to use the same subscriber as I created in the ME. Just in case anyone wants to have a look.

robbkidd commented 1 year ago

Talked about this during the SIG today. Attendees are fans of the idea to move away from monkeypatching and towards AS::N subscriptions.

tjefferson08 commented 1 year ago

Can AS::N support current_span ?

One thing I do a lot in my apps:

# in a controller
def index
  OpenTelemetry::Trace.current_span.add_attributes(interesting_stuff)
  # ... rest of action
end

I'm not super familiar with AS::N, but I'd think deferring span creation to after the execution of index would break this use case.

arielvalentin commented 1 year ago

@ahayworth I started to look into this a little bit today and noticed a few constraints and dependencies that will make it difficult to use the existing AS Notifications instrumentation outside of a Rails application.

The SpanSubscriber has a director dependency on Rails and assumes it is installed (introduced here https://github.com/open-telemetry/opentelemetry-ruby/pull/993):

https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/active_support/lib/opentelemetry/instrumentation/active_support/span_subscriber.rb#L33

Some applications, e.g. the GitHub Monolith, do not load the entire Rails environment at startup and in some cases only loads a subset of ActiveSupport, ActiveJob or ActiveRecord. We currently do not use ActiveSupport instrumentation in the monolith but we do use the current ActiveRecord instrumentation.

We would need to make changes to the existing AS Notifications gem in order to remove this dependency on Rails.

arielvalentin commented 1 year ago

I've opened this PR, which should address issues with using AS Notifications outside of Rails applications

https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/242

arielvalentin commented 1 year ago

cc: https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/269 https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/268

ahayworth commented 1 year ago

Apologies for disappearing for awhile on this - I took a break from all work for a long time. On the upside, I'm fully refreshed and ready to work on things. 😄

Some applications, e.g. the GitHub Monolith, do not load the entire Rails environment at startup

@arielvalentin thank you for making changes to support this use case! It's exceedingly uncommon (as you and I both know), but an important one nonetheless. ❤️

Can AS::N support current_span ?

@tjefferson08 It depends ™️ primarily on when the AS::Notification is created. Many things in rails do something like this (really inaccurate pseudocode, apologies):

inside_notification(name: blah) do
  do_some_work
end

In that case, accessing current_span should work well. However, if the notification is generated completely after the thing it instruments, then accessing current_span would not work as expected. We'll need to figure out how widespread each variant is within Rails, and I don't know the answer to that off the top of my head. I'll add it to the todo list.

ahayworth commented 1 year ago

@arielvalentin I took a look through our instrumentation, mapping things to AS::N calls. Basically, active_record is really the sticky one - I'm not sure how we'd be able to get things like User#find with precisely the same semantics as we have before.

I'm curious what you think, take a look at the updated checklist with some notes. Maybe this is something we should talk to upstream about?

arielvalentin commented 1 year ago

What did dd-trace-rb beelines, NewRelic, or signal-fx do?

arielvalentin commented 1 year ago

cc: https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/293

ahayworth commented 1 year ago

It looks like they just subscribe to the two events - instantiate and sql - and create pretty basic spans around them. They don't create different span names depending on the different methods called (#create vs #destroy, etc) like we do: https://github.com/DataDog/dd-trace-rb/tree/a7c8aa5d81b67d136665dfb86d2345b783fb2290/lib/datadog/tracing/contrib/active_record

arielvalentin commented 1 year ago

I think that's ok. Low cardinality spans names and high cardinality tags I think is the better way to go.

How do others feel? @open-telemetry/ruby-contrib-approvers

arielvalentin commented 1 year ago

Looking at the notification events I see some interesting things at least for ActiveRecord:

https://guides.rubyonrails.org/active_support_instrumentation.html#active-record

Key	Value	OTel	Description
:sql	SQL statement	`db.statement`	Optional since we probably favor DB Driver
:name	Name of the operation	`span.name`	per the example it likes like to would be `Post Load`
:connection	Connection object	`net.*`	we can extract attributes about the DB itself
:binds	Bind parameters	N/A
:type_casted_binds	Typecasted bind parameters	N/A
:statement_name	SQL Statement name	?	Unclear if this would be helpful to record. Is this the Prepared Statement Name?
:cached	true is added when cached queries used	? `sql.cached` `rails.activerecord.query.cached` ?	Unclear if this should be custom

ericmustin commented 1 year ago

@ahayworth @arielvalentin wrt span name not having the same semantics in ddtracerb I think what’s worth understanding is that the ddtracerb software works tightly coupled to the datadog agent, where a combination of obfuscation+Normalization occurs on the sql statement (attached in datadogs tracing system as span.resource ). This creates the same behavior of “low cardinality span resource high cardinality span tags” as opentelemetry’s “low cardinality span name high cardinality span attributes” approach, but since opentelemetry lacks any standard sql obfuscation+normalization, and we don’t take that approach currently anyway for span name, @ahayworth is correct that migrating to this style of instrumentation would effectively be a breaking change to our instrumentation output (although not according to semver2.0 since pre 1.0 software offers no such guarantees). Whether or not it’s worth that break, idk. Perhaps we could introduce this gracefully, so users could stay on the old style of instrumentation, or we could use the 1.0 as an opportunity to change span naming convention.

arielvalentin commented 1 year ago

Perhaps the best way to move forward is to catalog ActiveSupport notifications and create a mapping guide, which may help better inform our decisions:

https://docs.google.com/spreadsheets/d/1T7uYBewovl6YGo0uAsvKLLSWO0z_m6-rRJJtzoPb264/edit

Here's my 2 cents:

I'm personally not very concerned about the breaking chances for span names, especially in cases like where the names deviate from the specification, which I think will be a welcomed change (e.g. ActionPack).

I'm in favor of ripping the bandaid off and skipping adding backward comparability for these. If folks want the old span names and instrumentation I'd advocate that we tell them to pin to an earlier version.

ahayworth commented 1 year ago

@arielvalentin I wasn't able to edit that spreadsheet, so I made a copy and updated it here (you should have edit access through your work account): https://docs.google.com/spreadsheets/d/1mObwGGdJtmeG2YCWCQxWEwwpBEOA-3EXb0WT83bWc8I/edit#gid=379636133

I haven't mapped any additional types yet, but I did finish pulling over all the events and info from the rails guides. Click the little expandy-thing by each group to see them all. I'll try to do some mapping tomorrow! 😄

github-actions[bot] commented 1 year ago

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

arielvalentin commented 1 year ago

👋🏼 @jhawthorn The OTel Ruby instrumentation currently relies on monkey patching Rails libraries and I am going to start to work on migrating them to use Active Support Notifications.

While investigating next steps for this implementation I saw that this PR https://github.com/rails/rails/pull/43390 removes start/finish methods from the Subscriber class, which is what we currently use in some instrumentations to know when to start and finish a span.

Here is an example of a custom subscriber that relies on those template methods: https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/racecar/lib/opentelemetry/instrumentation/racecar/process_message_subscriber.rb

This is important because it allows us propagate the parent span context so that when we create additional child spans they are properly linked.

If I understand the changes correctly there will be backward compatibility so we will be able to continue to use the start and stop methods after Rails 7.1, is that correct?

Is there a plan to drop support for the those lifecycle methods in favor of something else?

arielvalentin commented 1 year ago

Hello friends: @robbkidd @robbkidd @chrisholmes @ahayworth

I started a bit of an internal refactoring (https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/641) to see what it would take to separate the logic of payload handling and the subscriber registration logic so that users would be able to customize the logic around creating spans, setting span attributes or adding events, etc... however this refactoring has left me with a more questions than answers.

Why did we choose to set the span name to this instead of using the name of the event given to the subscriber? name.split('.')[0..1].reverse.join(' ').freeze

Did we ever intend on supporting regular expressions for the subscriber like AS::Notifications supports?

Did we ever intend on adding support for Span Links or did we implicitly only ever want to use Parent-Child?

The payload does not include the context object and instead has a token. Is the only way for children to access the context is to use the Thread Local Current Context?

Do we envision a scenario ever where AS::Notifications switches to out of thread processing making the current thread context stack invalid?

arielvalentin commented 11 months ago

We have started moving forward with the work here and have completed the migration for active_job and action_pack.

Here is what we have come up with so far:

Prefer Semantic Conventions whenever possible
Use the Rails defined Event Name when semantic conventions do not exist, i.e no name.split('.')[0..1].reverse.join(' ').freeze

arielvalentin commented 11 months ago

@jhawthorn @bensheldon @composerinteralia I started looking into switching ActiveRecord instrumentation to use sql.active_record notifications but wanted to highlight what @ahayworth points out in the description of this issue:

sql.active_record is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.

The current instrumentation patches AR persistence and query methods that add non-semconv internal spans but as a result they capture time spend in callbacks.

Do you think Rails upstream would be amenable to adding notifications around method AS methods so we are able to capture some callback timings?

If not, then I think we have to stick with patching AR and use sql.active_record to enrich spans with information about whether or not the query was cached or if it was performed async.

The lower level DB drivers are the ones performing sanitization for non-AR users but we could potentially have some cost savings here if the bind parameters are able to give us a hint that the query is already sanitized and used that in lieu of obfuscating the db query.

Some bug reports we received about our current instrumentation come from the fact that some gems patch AR incorrectly and change the method signature of the AR public API. I know it's not something that we should worry about, but I also do not want the instrumentation to fall behind changes in Rails that would result in the same problem.

I am considering updating the instrumentation to use argument forwarding instead of redefining method signatures:

         # This is the current definition in the gem 
          def decrement!(attribute, by = 1, touch: nil)
            tracer.in_span("#{self.class}#decrement!") do
              super
            end
          end

         # using forwarding to mitigate method signature changes
          def decrement!(...)
            tracer.in_span("#{self.class}#decrement!") do
              super
            end
          end

Taking that a step further I am considering proposing lower cardinality span names with semconv attributes:

         # using forwarding to mitigate method signature changes
          def decrement!(...)
            tracer.in_span("active_record#decrement!", attributes: { "code.namespace" => self.class.name }) do
              super
            end
          end

What are y'alls thoughts on this?

jhawthorn commented 10 months ago

sql.active_record is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.

payload[:name] captures these two. It like won't report 1:1 exactly what otel previously reported, but that may be a good thing (to avoid user confusion in having different things logged in development vs reported in production). There may be some missing (ex. decrement/decrement are reported as "update" which is correct, but maybe not as specific as desired) I'd be happy to help filling in those gaps.

>> ActiveSupport::Notifications.subscribe("sql.active_record") {|_,_,_,_,payload| p payload[:name] }
>> User.first
"User Load"
>> user = User.create!
"TRANSACTION"
"User Create"
"TRANSACTION"
>> user.destroy!
"TRANSACTION"
"User Destroy"
"TRANSACTION"

arielvalentin commented 8 months ago

payload[:name] captures these two. It like won't report 1:1 exactly what otel previously reported, but that may be a good thing (to avoid user confusion in having different things logged in development vs reported in production). There may be some missing (ex. decrement/decrement are reported as "update" which is correct, but maybe not as specific as desired) I'd be happy to help filling in those gaps.

Then in these cases using the name key may be enough for naming spans.

What about measuring time spent in AR callbacks? AFAICT sql.active_record timing does not include that.

olepbr commented 1 week ago

I believe https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/1258 is another example of "monkey-patching is fraught with peril" =)

open-telemetry / opentelemetry-ruby-contrib

Migrate Rails instrumentation to use only ActiveSupport::Notifications #218