open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
337 stars 164 forks source link

Ephemeral Resource Attributes #208

Closed tedsuo closed 1 year ago

tedsuo commented 2 years ago

This OTEP is part of the RUM/Client initiative.

Currently, we are missing a place to put important client information which applies to all telemetry emitted by an SDK. This information includes attributes such as session ID, language preference, locality/timezone, and other types of user data.

Normally, these attributes would be recorded as resources. However, on client processes, there are times when this information changes without the SDK re-initializing. For example:

In all of these cases, the application/SDK is not restarted. Currently, the resource associated with the SDK cannot be changed after it is started. This makes it very difficult to record these needed attributes.

This OTEP proposes a mechanism for updating the SDK with a new resource, which will be applied to all future telemetry created by the SDK. The proposal attempts to do this while preserving important characteristics already defined for resources:

If there are other backwards compatibility requirements for resources that I have missed, please let me know.

Cheers, -Ted

carlosalberto commented 2 years ago

@tedsuo Thanks - I feel like some examples would be great, as it seems it's the Validator the one separating Resources/Attributes between permanent and ephemeral?

tedsuo commented 2 years ago

Sure, no problem @carlosalberto. Would you want an example implementation? Or an example use case?

tedsuo commented 2 years ago

Added an example implementation and example use case.

tedsuo commented 2 years ago

Yes? No? What should we do here? Based to these requirements, it would be good to understand how the TC the would like to move forward.

tigrannajaryan commented 2 years ago

@tedsuo the spec defined Resource like this:

A Resource is an immutable representation of the entity producing telemetry as Attributes.

This text is in a Stable spec document. How do we reconcile this OTEP with the spec's stance on immutability of the Resource? Are you suggesting that we break a Stable spec document? Or you do not think this is a breaking change?

t2t2 commented 2 years ago

How do we reconcile this OTEP with the spec's stance on immutability of the Resource?

This doesn't change anything about current resource immutability - an update on the resource provider would end up in a new resource instance. To speak in code:

const resourceProvider = new ResourceProvider({
    // Initial set of attributes, internally does a new Resource(attrs) and stores it as current value
    'session.id': '1',
});
const tracerProvider = new TracerProvider({ resourceProvider });
const tracer = tracerProvider.getTracer(/* irrelevant */);

const span1 = tracer.getSpan(/* ... */);
// internally span.resource = tracer.tracerProvider.resourceProvider.getResource()

// Some time later, user logs in and their identity is known
resourceProvider.setAttribute('enduser.id', 'superadmin');
// internally currentResource = currentResource.merge(new Resource(newAttrs)), which as per the current spec
// returns a new Resource with merged attrs
// That new Resource is set as the current value in ResourceProvider

// Or session expires and a new one is set
resourceProvider.setAttribute('session.id', '2')

const span2 = tracer.getSpan(/* ... */);

span1.resource !== span2.resource

assert.deepEquals(span1.resource.attributes, { 'session.id': '1' });
assert.deepEquals(span2.resource.attributes, { 'session.id': '2', 'enduser.id': 'superadmin' });
tigrannajaryan commented 2 years ago

This doesn't change anything about current resource immutability - an update on the resource provider would end up in a new resource instance.

I disagree. This is not just about a Resource instance in memory. It is about the Resource that is emitted by the instrumented application. The recipients of telemetry expect that the resource is immutable, i.e. its attributes do not change over time.

The OTEP talk about this in the "Trade-offs and mitigations" section. I think this is a breaking change. It breaks the contract between Otel sources and telemetry destinations. The OTEP text even recommends this:

In this case, it is recommended that these systems modify their behavior

I don't think this is acceptable. We are saying that "yes, we broke the contract, deal with it". IMO, we cannot do that.

tigrannajaryan commented 2 years ago

I thought a bit more about this, I want to find a solution.

I don't think we can delete the requirement which says the Resource is immutable. I think this needs to stay otherwise we are breaking the contract. Additionally, unfortunately the spec says we are not allowed to change the association of the Resource and TracerProvider once that association is established:

a resource can be associated with the TracerProvider when the TracerProvider is created.

However, let's step back for a moment. I don't think recipients of telemetry care about the association inside the SDK. The recipients care about the data model and data model certainly allows the SDK to emit telemetry associated with different Resources. A new TracerProvider can be created with a new Resource and can be used to emit telemetry that was previously emitted using a different TracerProvider and this is completely legal.

Given the above, I do not see any clause in the spec that directly prohibits us from introducing a new way for TracerProvider to be associated with some proxy object which itself is associated with a Resource and allow that association to change over time. Yes, this is in a sense cheating, but it allows to introduce this new way such that it is not a breaking change for the SDK. That's what the proxy ResourceProvider here does.

To me the following questions remain:

  1. Is it right that session id is part of the Resource? It doesn't feel right but I can't put my finger on it, so I will refrain from objecting to this for now.
  2. Why do we need to introduce anything called "ephemeral attributes"? I think this is not needed. They are regular attributes just like any other. Nothing ephemeral here. We only introduce a new way to specify the Resource that must be associated with the produced telemetry. That's all it is. It is a regular Resource, an immutable one. Attributes are all regular.
  3. Is it really possible to introduce ResourceProvider with the ability to attach it to TracerProvider in a way that does not break any existing code? We need to see prototypes that demonstrate this.
martinkuba commented 2 years ago

Is it right that session id is part of the Resource?

It is an attribute that applies to all telemetry coming out of the application. It does not change from signal to signal, nor is it scoped to a specific instrumentation. I don't think there is any other place it could go than the resource level (given the current data model).

Why do we need to introduce anything called "ephemeral attributes"?

I think this is an attempt to alleviate the contract between OTel sources and destinations. If there is a real reason that backends need to have an immutable set of resource attributes per application instance, then this would make it possible by defining in the semantic conventions which attributes are permanent and which can change.

We assumed that the only reason backends would be relying on this contract is if they were doing something like hashing all the resource attributes (e.g. to identify the instance). Yes, this would force these backends to be updated, but it would provide them with a way to continue using the hashing. Also, since the TracerProvider can be recreated within the same application instance, defining which attributes are permanent or ephemeral is just making it explicit.

Aneurysm9 commented 2 years ago

Is it right that session id is part of the Resource?

It is an attribute that applies to all telemetry coming out of the application. It does not change from signal to signal, nor is it scoped to a specific instrumentation. I don't think there is any other place it could go than the resource level (given the current data model).

I'm not sure I see it the same way. Does it truly apply to all telemetry coming out of the application? Is it not possible for the same application instance to have two sessions active? Doesn't the fact that it can change while the application is running necessarily mean that it does not apply to all telemetry? Yes, the "session ID" attribute as a concept does, but not any given value. That is different from all other resource attributes.

As for not being scoped to a specific instrumentation, it is akin to the trace ID in that it can be used for correlation of signals. How would it be useful with distribution metrics? Do I really care to have a timeseries for every user session to track load times, or do I want to have a more general metric that has exemplars pointing at potentially interesting sessions?

As for where else it could go, it could certainly be added as a scope attribute. This would require a bit more bookkeeping on the part of the instrumentor to keep a map of sessions to tracers, etc., or to store them in session-scoped storage, but is feasible. More appropriate, perhaps, would be in the context where it would be available to all signals. Propagation across process boundaries to allow for correlation (I assume a session can be serviced by application elements that are outside of the immediately user-facing process) is still an issue. I think, though that this all reinforces my belief that session ID and trace ID are synonymous and that sessions are simply long traces. Do we really need a new concept, and to contort ourselves to find ways to claim that we're not breaking compatibility with a stable specification, to handle something that the existing concepts can already handle?

t2t2 commented 2 years ago

Note: I originally started this as part of response to https://github.com/open-telemetry/opentelemetry-specification/issues/2500#issuecomment-1249387117 but the first section ended up being more related to this otep being stuck, so here it is!


Let's eliminate the confusion of what a session means for a bit. There are some other attributes that are

1) good candidates for resource level 2) value can change over time 3) as a concept is more familiar to backend service / APM kind of usage that current otel contributors are a lot more familiar with

Let's bring in enduser.id

Currently it's defined as a span level identifying attribute. Which hey, makes total sense in a server side environment. You've got a server side service that you can have one server serve all of the users of the application. If you'd want to have enduser that caused a request set on all of the child spans, yes context makes a lot of sense since the entire server isn't dedicated to just one enduser. Anyways got my 3rd condition

Let's jump to client side. I open up local food delivery app, and it's instrumented to generate telemetry. Alright, what's the resource attributes. Well you've got

I try to order something but suddenly app runs into a bug and crashes. Smash cut, support person is messaging devop team "hey got this guy going crazy over not being able to order, can you figure out what's going on there, why his app keeps crashing". Devops looks up data based on my name, sees an attempt to order 100 kebabs in tracing spans that caused KebabStackOverflow in logs. Really this paragraph is only here to be referenced later while still having a linear timeline for the domain knowledge story

Somehow you're next to me and mention you uninstalled the app due to constant crashing a while ago and now have a please come back discount on your account. I hand my phone to you, you log my account out and log in with your account. And manage to successfully order after a more reasonable order size.

Now the logged in account has changed, so if the logged in account is in resource, the above telemetry should be over 3 resources: Data from me, data from anonymous user, data from you. (so now we've fulfilled condition 2)


Other than local food delivery app, some other examples:

But other potential attributes:

So I think something people who haven't built a RUM need to consider is that a major difference between backend services and client side apps is that apps have (a lot more) state. A lot of this state is global (not scoped to parts of the app like within one request that is forgotten once the request is fulfilled), it changes over time (due to time, user interactions, or completely external actions) and in a lot of the cases it's useful or needed to assist in debugging using gathered telemetry (who, what device, what screen/url, what isp, geolocation)


A lot of these attributes are also what you'd want to query data by. Already mentioned looking up data based on app user info, but let's consider some of the RUM use cases:

These add considerations for efficient data ingest and storage. Now every vendor will probably have different opinions on this based on how they use and store data. In July I got some knowledge from @mdubbyap on splunk/signalfx ingest side about our use cases (and probably should have used this knowledge earlier so I don't accidentally misremember it but oh well Ted's been on vacation anyway so it wouldn't have helped move this forward):

For ingesting best is to minimise the amount of bytes that needs to be read in order to determine where to pipe the data to (be it partitioning, buckets or whatever optimises your infra). Worst is having to read deep enough to get into each span/log/metric and check it's attributes for the value. If it's a value on the resource, ingest only needs to read resource's bytes before determining where to send the data, not needing to parse the rest of the payload. (Since we focus on showing session experience, then obviously for us session.id attribute is 👀👀)


There also can be ways fulfilling legal requirements can be easier if these attributes are more easily readable, eg. indexing data based on enduser info to make deleting data on user request (such as GDPR) to be easier

Also linking https://github.com/open-telemetry/opentelemetry-specification/issues/2775 as it's gone into topic of descriptive or identifying attributes, which has been to be one of the reasons against this otep so far

scheler commented 1 year ago

Hi, wanted to give an update on this topic, since some of us from the client-side-telemetry SIG have asked a few TC members to help us on the topic further. Copying the message that @jack-berg posted on slack -

Oberon00 commented 1 year ago

@scheler

We think you should pursue a strategy where these attributes are set in context, and lifted out of context onto the individual records in a custom SpanProcessor / LogRecordProcessor

This is what I proposed in OTEP #207 to be a blessed concept with its own API, by the way. But as you said, it is in principle implementable today.

tedsuo commented 1 year ago

Closing this in favor of a new proposal coming from the RUM/Client group.