Open ahayworth opened 1 year ago
This is the PR that improved safety in Rails 7: https://github.com/rails/rails/pull/43282
I think this is a great idea. Even if we discover issues with ASN, that sounds like a great reason to contribute fixes upstream (to Rails) 😄
This is cool to see!
I noted in my MR here https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/204 that the ActionPack instrumentation could probably be refactored to use the same subscriber as I created in the ME. Just in case anyone wants to have a look.
Talked about this during the SIG today. Attendees are fans of the idea to move away from monkeypatching and towards AS::N subscriptions.
Can AS::N support current_span
?
One thing I do a lot in my apps:
# in a controller
def index
OpenTelemetry::Trace.current_span.add_attributes(interesting_stuff)
# ... rest of action
end
I'm not super familiar with AS::N, but I'd think deferring span creation to after the execution of index
would break this use case.
@ahayworth I started to look into this a little bit today and noticed a few constraints and dependencies that will make it difficult to use the existing AS Notifications instrumentation outside of a Rails application.
The SpanSubscriber
has a director dependency on Rails
and assumes it is installed (introduced here https://github.com/open-telemetry/opentelemetry-ruby/pull/993):
Some applications, e.g. the GitHub Monolith, do not load the entire Rails environment at startup and in some cases only loads a subset of ActiveSupport
, ActiveJob
or ActiveRecord
. We currently do not use ActiveSupport
instrumentation in the monolith but we do use the current ActiveRecord
instrumentation.
We would need to make changes to the existing AS Notifications gem in order to remove this dependency on Rails.
I've opened this PR, which should address issues with using AS Notifications outside of Rails applications
https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/242
Apologies for disappearing for awhile on this - I took a break from all work for a long time. On the upside, I'm fully refreshed and ready to work on things. 😄
Some applications, e.g. the GitHub Monolith, do not load the entire Rails environment at startup
@arielvalentin thank you for making changes to support this use case! It's exceedingly uncommon (as you and I both know), but an important one nonetheless. ❤️
Can AS::N support current_span ?
@tjefferson08 It depends ™️ primarily on when the AS::Notification
is created. Many things in rails do something like this (really inaccurate pseudocode, apologies):
inside_notification(name: blah) do
do_some_work
end
In that case, accessing current_span
should work well. However, if the notification is generated completely after the thing it instruments, then accessing current_span
would not work as expected. We'll need to figure out how widespread each variant is within Rails, and I don't know the answer to that off the top of my head. I'll add it to the todo list.
@arielvalentin I took a look through our instrumentation, mapping things to AS::N
calls. Basically, active_record
is really the sticky one - I'm not sure how we'd be able to get things like User#find
with precisely the same semantics as we have before.
I'm curious what you think, take a look at the updated checklist with some notes. Maybe this is something we should talk to upstream about?
What did dd-trace-rb beelines, NewRelic, or signal-fx do?
It looks like they just subscribe to the two events - instantiate
and sql
- and create pretty basic spans around them. They don't create different span names depending on the different methods called (#create
vs #destroy
, etc) like we do: https://github.com/DataDog/dd-trace-rb/tree/a7c8aa5d81b67d136665dfb86d2345b783fb2290/lib/datadog/tracing/contrib/active_record
I think that's ok. Low cardinality spans names and high cardinality tags I think is the better way to go.
How do others feel? @open-telemetry/ruby-contrib-approvers
Looking at the notification events I see some interesting things at least for ActiveRecord:
https://guides.rubyonrails.org/active_support_instrumentation.html#active-record
Key | Value | OTel | Description |
---|---|---|---|
:sql | SQL statement | db.statement |
Optional since we probably favor DB Driver |
:name | Name of the operation | span.name |
per the example it likes like to would be Post Load |
:connection | Connection object | net.* |
we can extract attributes about the DB itself |
:binds | Bind parameters | N/A | |
:type_casted_binds | Typecasted bind parameters | N/A | |
:statement_name | SQL Statement name | ? | Unclear if this would be helpful to record. Is this the Prepared Statement Name? |
:cached | true is added when cached queries used | ? sql.cached rails.activerecord.query.cached ? |
Unclear if this should be custom |
@ahayworth @arielvalentin wrt span name not having the same semantics in ddtracerb I think what’s worth understanding is that the ddtracerb software works tightly coupled to the datadog agent, where a combination of obfuscation+Normalization occurs on the sql statement (attached in datadogs tracing system as span.resource ). This creates the same behavior of “low cardinality span resource high cardinality span tags” as opentelemetry’s “low cardinality span name high cardinality span attributes” approach, but since opentelemetry lacks any standard sql obfuscation+normalization, and we don’t take that approach currently anyway for span name, @ahayworth is correct that migrating to this style of instrumentation would effectively be a breaking change to our instrumentation output (although not according to semver2.0 since pre 1.0 software offers no such guarantees). Whether or not it’s worth that break, idk. Perhaps we could introduce this gracefully, so users could stay on the old style of instrumentation, or we could use the 1.0 as an opportunity to change span naming convention.
Perhaps the best way to move forward is to catalog ActiveSupport notifications and create a mapping guide, which may help better inform our decisions:
https://docs.google.com/spreadsheets/d/1T7uYBewovl6YGo0uAsvKLLSWO0z_m6-rRJJtzoPb264/edit
Here's my 2 cents:
I'm personally not very concerned about the breaking chances for span names, especially in cases like where the names deviate from the specification, which I think will be a welcomed change (e.g. ActionPack).
I'm in favor of ripping the bandaid off and skipping adding backward comparability for these. If folks want the old span names and instrumentation I'd advocate that we tell them to pin to an earlier version.
@arielvalentin I wasn't able to edit that spreadsheet, so I made a copy and updated it here (you should have edit access through your work account): https://docs.google.com/spreadsheets/d/1mObwGGdJtmeG2YCWCQxWEwwpBEOA-3EXb0WT83bWc8I/edit#gid=379636133
I haven't mapped any additional types yet, but I did finish pulling over all the events and info from the rails guides. Click the little expandy-thing by each group to see them all. I'll try to do some mapping tomorrow! 😄
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep
label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.
👋🏼 @jhawthorn The OTel Ruby instrumentation currently relies on monkey patching Rails libraries and I am going to start to work on migrating them to use Active Support Notifications.
While investigating next steps for this implementation I saw that this PR https://github.com/rails/rails/pull/43390 removes start/finish
methods from the Subscriber
class, which is what we currently use in some instrumentations to know when to start
and finish
a span.
Here is an example of a custom subscriber that relies on those template methods: https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/racecar/lib/opentelemetry/instrumentation/racecar/process_message_subscriber.rb
This is important because it allows us propagate the parent span context so that when we create additional child spans they are properly linked.
If I understand the changes correctly there will be backward compatibility so we will be able to continue to use the start
and stop
methods after Rails 7.1, is that correct?
Is there a plan to drop support for the those lifecycle methods in favor of something else?
Hello friends: @robbkidd @robbkidd @chrisholmes @ahayworth
I started a bit of an internal refactoring (https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/641) to see what it would take to separate the logic of payload handling and the subscriber registration logic so that users would be able to customize the logic around creating spans, setting span attributes or adding events, etc... however this refactoring has left me with a more questions than answers.
Why did we choose to set the span name to this instead of using the name of the event given to the subscriber? name.split('.')[0..1].reverse.join(' ').freeze
Did we ever intend on supporting regular expressions for the subscriber like AS::Notifications supports?
Did we ever intend on adding support for Span Links or did we implicitly only ever want to use Parent-Child?
The payload does not include the context object and instead has a token. Is the only way for children to access the context is to use the Thread Local Current Context?
Do we envision a scenario ever where AS::Notifications switches to out of thread processing making the current thread context stack invalid?
We have started moving forward with the work here and have completed the migration for active_job
and action_pack
.
Here is what we have come up with so far:
name.split('.')[0..1].reverse.join(' ').freeze
@jhawthorn @bensheldon @composerinteralia I started looking into switching ActiveRecord
instrumentation to use sql.active_record
notifications but wanted to highlight what @ahayworth points out in the description of this issue:
sql.active_record
is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.
The current instrumentation patches AR persistence and query methods that add non-semconv internal spans but as a result they capture time spend in callbacks.
Do you think Rails upstream would be amenable to adding notifications around method AS methods so we are able to capture some callback timings?
If not, then I think we have to stick with patching AR and use sql.active_record
to enrich spans with information about whether or not the query was cached
or if it was performed async
.
The lower level DB drivers are the ones performing sanitization for non-AR users but we could potentially have some cost savings here if the bind parameters are able to give us a hint that the query is already sanitized and used that in lieu of obfuscating the db query.
Some bug reports we received about our current instrumentation come from the fact that some gems patch AR incorrectly and change the method signature of the AR public API. I know it's not something that we should worry about, but I also do not want the instrumentation to fall behind changes in Rails that would result in the same problem.
I am considering updating the instrumentation to use argument forwarding instead of redefining method signatures:
# This is the current definition in the gem
def decrement!(attribute, by = 1, touch: nil)
tracer.in_span("#{self.class}#decrement!") do
super
end
end
# using forwarding to mitigate method signature changes
def decrement!(...)
tracer.in_span("#{self.class}#decrement!") do
super
end
end
Taking that a step further I am considering proposing lower cardinality span names with semconv attributes:
# using forwarding to mitigate method signature changes
def decrement!(...)
tracer.in_span("active_record#decrement!", attributes: { "code.namespace" => self.class.name }) do
super
end
end
What are y'alls thoughts on this?
sql.active_record
is great, but it's low-level and won't capture whether we are doing a #find or a #destroy, etc. Nor will it capture callbacks.
payload[:name]
captures these two. It like won't report 1:1 exactly what otel previously reported, but that may be a good thing (to avoid user confusion in having different things logged in development vs reported in production). There may be some missing (ex. decrement/decrement are reported as "update" which is correct, but maybe not as specific as desired) I'd be happy to help filling in those gaps.
>> ActiveSupport::Notifications.subscribe("sql.active_record") {|_,_,_,_,payload| p payload[:name] }
>> User.first
"User Load"
>> user = User.create!
"TRANSACTION"
"User Create"
"TRANSACTION"
>> user.destroy!
"TRANSACTION"
"User Destroy"
"TRANSACTION"
payload[:name] captures these two. It like won't report 1:1 exactly what otel previously reported, but that may be a good thing (to avoid user confusion in having different things logged in development vs reported in production). There may be some missing (ex. decrement/decrement are reported as "update" which is correct, but maybe not as specific as desired) I'd be happy to help filling in those gaps.
Then in these cases using the name
key may be enough for naming spans.
What about measuring time spent in AR callbacks? AFAICT sql.active_record
timing does not include that.
I believe https://github.com/open-telemetry/opentelemetry-ruby-contrib/issues/1258 is another example of "monkey-patching is fraught with peril" =)
We recently uncovered an issue wherein our action_pack instrumentation caused a production problem. There are many ways to consider such an issue, but one way is to acknowledge that monkey-patching is fraught with peril, and that we should consider different methods of instrumentation.
For Rails, we should be able to use ActiveSupport::Notifications. We already use this to great effect in our action_view instrumentation, and we think we can expand this further to the rest of our Rails instrumentation.
To do this, we need to:
process_action.action_controller
is probably the right one.payload.response
andpayload.status
will be set when the notification finishesaround_enqueue
filter; which is precisely what Rails does.perform.active_job
will likely cover the rest.sql.active_record
andinstantiation.active_record
.sql.active_record
is great, but it's low-level and won't capture whether we are doing a#find
or a#destroy
, etc. Nor will it capture callbacks.instantiation.active_record
will capture callbacks, but it's pretty basic and still doesn't tell us what method we're really in.current_span
, which may be a deal-breaker.