open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.64k stars 871 forks source link

Adding resource attributes post-creation (e.g. via auto-discovery) #1298

Open vovencij opened 3 years ago

vovencij commented 3 years ago

What are you trying to achieve?

I would like to have a possibility to add resource attributes via some auto-discovery mechanism, where attribute values are not known at the Resource creation time, but rather become known during application initialization. In case of auto-instrumentation Otel SDK is created very early in process runtime. Some useful information about the process may not be accessible yet.

An example may be a process of discovering an app server name and its version (https://github.com/open-telemetry/opentelemetry-specification/issues/1143) by instrumentation, rather than setting these values via environment. Even without instrumentation, discovering and setting a version of an application (or whatever component that makes sense at the Resource level) at runtime is much easier than obtaining it before an application starts (as is needed for setting attributes via environment).

Oberon00 commented 3 years ago

I think there are several possible routes we can go here:

Oberon00 commented 3 years ago

Note, this was also briefly discussed in https://github.com/open-telemetry/opentelemetry-specification/pull/515#discussion_r393752660

anuraaga commented 3 years ago

I've always wished Resource is an interface to allow implementing a dynamic one. Some things are resolved lazily, or change very rarely, but not never, being able to use Resource for them would be nice.

Oberon00 commented 3 years ago

Let's try to not get too much into implementation details of how to allow changing resources (interface / struct / sync / async / auto-detected / explicit) at first and maybe discuss change in semantics and impact on backends/collectors/maybe exporters.

I think this has the potential to break assumptions and optimizations. The resource is supposed to capture identification information about the entity (runtime/process/container/...) producing telemetry. It would be very annoying if the backends identifies the resource as belonging to different entities over the lifetime of a telemetry producer just because some identification information is missing.

For example, a backend might have an entity describing a particular WildFly cluster, which may consist of multiple instances. If the cluster name is only detected later, any spans reported earlier would be mapped to the wrong entity, or otherwise the backend would need to retroactively move previously reported spans to a different entity. Same thing applies to the faas.id (AWS ARN), host.name, ...

anuraaga commented 3 years ago

Hmm - I've always considered Resource as a means of compression. Instead of sending the same attribute over and over, just send it once. Similar to gzip but doesn't take extra resources to achieve since it's in the protocol.

I never considered Resource semantically and didn't really expect to. There's no unique ID on it (well maybe this was the reason for the service ID I didn't understand well? 😅). So I figured it's just a bag of attributes and that keeps things simple IMO. Yes if for some reason some spans got sent before the resource got resolved the bag would have less attributes, but I don't know if that could really have any worse effect on the backend than the loss of data, which was inevitable since the data didn't exist at the time.

iNikem commented 3 years ago

As an example, we already have host.name resource attribute. And this can easily change during process run time.

Oberon00 commented 3 years ago

I assume you mean the actual value of the hostname may change. The resource attribute will always be what was initially set, won't it?

Oberon00 commented 3 years ago

Yes if for some reason some spans got sent before the resource got resolved the bag would have less attributes, but I don't know if that could really have any worse effect on the backend than the loss of data, which was inevitable since the data didn't exist at the time.

The loss of data can in theory be prevented at the client side by delaying the sending of spans/metrics (buffering them) until all (expected) resources are available. Whether this is more desirable than data loss is questionable though (What if a problem prevents the resource from ever being detected? Aren't e.g. memory metrics especially interesting also during startup?).

For Dynatrace the effect of the missing attributes is definitely worse. We have a quite entity-centric UI so users will primarily interact with entities like "MyLambdaFunction". If the ARN is missing from a few spans, we will detect a different entity like "AwsLambdaMainClass Java process" for these. This is arguably "wrong data" which is worse than missing data. Experience with customer complaints about similar things tells us that this is something undesirable.

I guess this is more of a problem of the particular attribute's detectability than of the Resource being mutable though. Even if the resource stays mutable, it may be sensible to require some attributes to be set already at SDK startup.

anuraaga commented 3 years ago

Even if the resource stays mutable, it may be sensible to require some attributes to be set already at SDK startup.

I definitely agree with this - it's semantics, e.g., service.name is so important for most backends that we really need it to be static. But at the same time there are not 100% static, but mostly static data or 100% static after the first request, not startup attributes. I think they should be able to take advantage of our protocol's compression strategy.

Oberon00 commented 3 years ago

I agree. But I think we should distinguish at least the "immutable and known from the beginning" part from the others. I think we should also distinguish "immutable" from "might change".

Another advantage of treating the mutable/delayed attributes as a separate concept is that we stay 100% compatible with backends that e.g. receive OTLP but do not expect the data in the existing resource protobuf field to change.

vovencij commented 3 years ago

I propose to restrict the scope of the initial request with only delayed, but immutable attributes, which may appear during runtime, but will stay immutable.

To make this feature really useful, I see that https://github.com/open-telemetry/opentelemetry-specification/issues/1330 is also required – If a resource is exposed as an API with setAttribute method, instrumentation can do auto-discovery of resource attributes.

iNikem commented 3 years ago

After carefully reading the spec I think that the immutability of Resource is an illusion. And it is certainly not true that all spans from the same process come with the same Resource instance.

First, Resource is associated with TracerProvider and spec allows for several different TracerProviders in the same process.

Second, spec is extremely vague on how Resource gets from TracerProvider to SpanExporter. Nothing in the spec right now forbids SpanProcessor to pass totally different Resource instance to SpanExporter.

Third, OTLP exporter already assume that there can be several different Resources coming from the same process.

Fourth, OTLP->Jaeger conversion also demands that "Multiple resources can exist for a single process and exporters need to handle this case accordingly."

Fifth, OTLP->Zipkin conversion ignores Resource altogether.

All in all, as per current specification all Otel compatible backends already have to deal with several resources coming from the same process without any guarantee about their immutability. Including the situation that spans from the same function in the app can come with different resource attributes at different times.

Thus I don't see how current proposal makes situation any worse.

jmacd commented 3 years ago

@vovencij @iNikem @Oberon00 @anuraaga

Learning from Metrics systems

A similar conversation has taken place in the metrics group, about whether in-process metrics processing pipelines should be able to presume a fixed set of resource attributes. Whether or not the OTLP Exporter is required to support multiple resource attribute sets has perhaps not been stated; I believe @bogdandrutu has in the past said that one SDK should emit one resource, as the resource concept is meant to identify an entity. Comments by @Oberon00 in this thread indicate that Dynatrace is looking to translate whole resources into entities as well.

@iNikem's point 4 and 5,

Multiple resources can exist for a single process and exporters need to handle this case accordingly.

Zipkin conversion ignores Resource altogether.

are echoed by @anuraaga:

I've always considered Resource as a means of compression.

I was trying to answer related questions a year ago in this OTEP issue proposing a "Resource Scope" concept, describing the idea of sub-resource: https://github.com/open-telemetry/oteps/issues/78. My impression is that users would like the flexibility to configure multiple "sub-entities" per process (across OTel signals), first as a matter of programming convenience and second as a matter of performance.

There is an open proposal that OTel SDK should require a service.instance.id attribute, https://github.com/open-telemetry/opentelemetry-specification/issues/1034, that would be a unique identifier for the process entity. If service.instance.id were adopted, we could formalize support for late-binding and/or multiple resources, without losing information about process entities.

In metrics, there a requirement that each timeseries identified by unique attributes has a single writer, https://github.com/open-telemetry/opentelemetry-specification/issues/1297. It is unsafe to remove unique-identifying attributes without re-aggregating those timeseries.

Should we distinguish Obligatory vs. Optional resource attributes?

In a Prometheus ecosystem, the corresponding attributes to service.name and service.instance.id are job and instance. It begins to look like some resource attributes are Obligatory while others are Optional, since it would be equally unsafe to remove those labels without correctly re-aggregating those timeseries.

In Prometheus, a number of what OTel calls "resources" are available from service discovery, where these attributes are given provisional names (starting with __meta_) that are made available, optionally, in relabeling rules. Whereas the original issue here is about late-binding resource attributes withing an SDK, it's common in metrics systems to late-bind resource attributes outside the process.

Metrics vendors also use cloud provider APIs directly to late-bind resource attributes, for example the AWS Resource Groups Tagging API. In order to late-bind resource attributes obtained from cloud providers, there are likely to be other Obligatory resource attributes needed. For example, to use the AWS API mentioned here you must have an "ARN", an Obligatory resource name, in order to late-bind its Optional attributes.

Brief proposal

If every resource attribute included the Obligatory vs. Optional distinction, we might then:

If we can formalize a semantic convention here, something like resource.from.<scheme> for defined schemes (e.g., "aws-xxx-arn"), with values like partition:service:region:account-id:resource-id. Ideally, Cloud providers would provide OpenTelemetry collector plugins to attach these Optional resources for us.

Oberon00 commented 3 years ago

@iNikem

Second, spec is extremely vague on how Resource gets from TracerProvider to SpanExporter. Nothing in the spec right now forbids SpanProcessor to pass totally different Resource instance to SpanExporter.

The SpanProcessor has no say, here, the resource is passed from the TracerProvider to the Tracer to the Span which then gets passed to the exporter, as spec'd here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#additional-span-interfaces

Readable span: A function receiving this as argument MUST be able to access all information that was added to the span, as listed in the API spec. In particular, it MUST also be able to access the InstrumentationLibrary and Resource information (implicitly) associated with the span.

and the Resource spec says it maybe even clearer https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md#resource-sdk:

When used with distributed tracing, a resource can be associated with the TracerProvider when the TracerProvider is created. That association cannot be changed later. When associated with a TracerProvider, all Spans produced by any Tracer from the provider MUST be associated with this Resource.

Note that both of these are already stable.

Oberon00 commented 3 years ago

@jmacd @iNikem

Multiple resources can exist for a single process and exporters need to handle this case accordingly.

I agree that multiple resources can exist via multiple tracer providers, but I don't think that associating the same exporter object instance to multiple tracer providers is necessarily the best way to solve this. In fact, I think we should specify that in general each SpanProcessor must be added to at most one TracerProvider and each Exporter must be added to add most one SpanProcessor. There may be cases where you'd want to deviate from that, but unless you have a particular use case you want to cover I would say keep it simple.

@jmacd

My impression is that users would like the flexibility to configure multiple "sub-entities" per process (across OTel signals), first as a matter of programming convenience and second as a matter of performance.

Fully agree, there are long-standing issues that call for something like that like #335 (I just linked that issue previously to a new issue that was talking about something similar for logs).

There is an open proposal that OTel SDK should require a service.instance.id attribute, #1034, that would be a unique identifier for the process entity.

I would be careful with calling it "process" entity. In practice it will more likely be an opentelemetry entity (telemetry.sdk.instance_id?), if it is auto-generated at least. If e.g. a process hosts multiple runtimes (or isolated java classloaders) it might be difficult to keep the service.instance.id equal across them. But I think that would still suffice for the use case you line out.

Obligatory vs. Optional resource attributes

I would be more happy with the terminology "Indentifier" vs "Descriptive" resource attributes. For example, while the service.name might really be considered obligatory and identifying by some, the ARN is definititely optional: Most processes won't have an ARN. But if they have an ARN, it is important for identifying the process.

The AWS Lambda ARN in particular is a difficult case since it is not available from the beginning but only once the first request to the Lambda comes in (which is potentially never in the case of provisioned concurrency or initialization failures, etc), or you request it per API. Issue #1261 is about that.

jsuereth commented 3 years ago

@jmacd I want to call on one point in your description that I'd like to discuss:

@iNikem's point 4 and 5,

Multiple resources can exist for a single process and exporters need to handle this case accordingly. Zipkin conversion ignores Resource altogether.

are echoed by @anuraaga:

I've always considered Resource as a means of compression.

From what I understand, Resource is the central concept that unifies telemetry. Resource represents the way we navigate from one item of telemetry to another in our observability system. It's how we can link from Logs => Metrics or Trace => Metrics. As such, I've always viewed it as a *Query against an APMs understanding of service/application structure".

When it comes to questions like "Can we allow attributes to be added to a resource later", I think it's a balancing act. Yes, strictly speaking, it will "hurt" the ability to go from a Trace to associated metrics purely by resource information. However, if you think of Resource as the "query" or "lookup" against that other piece of telemetry, then I think it actually is "Ok" for a lot of use cases.

I want to make sure we're all in Sync about this meaning for resource in the spec, or adjust my understanding if it's incorrect.

I think the proposal for Optional / Obligatory resources fits within my notion of "can we cross reference between telemetry from the same source".

So, I think I'd add an addendum to your proposal:

I think metrics, by nature of solutions, suffer more from incomplete labels than other sources.

Oberon00 commented 3 years ago

Hence why I suggest we use "identifying" and "descriptive"/"non-identifying". Required (obligatory) is IMHO, if not orthogonal, not the same.

jmacd commented 3 years ago

I like @Oberon00's suggested terms, "Identifying" and "Descriptive".

anuraaga commented 3 years ago

From what I understand, Resource is the central concept that unifies telemetry.

Is it only resource though? There are many attributes which I think also are meant for linking telemetry. For example, rpc method seems like one that will be used to jump between signals - I think this is what's considered identifying in this proposal.

On the flip side, Resource has attributes which seem to be descriptive, such as the command line.

So I guess I just can't imagine a backend handling Resource in a special way compared to attributes - finding all identifiers means looking at both, as is finding all the descriptors.

That being said, it probably doesn't affect the proposal that much which seems to be going in the direction of having delayed resource attributes be possible if I understand correctly. I'm probably missing something big, but does it matter if an attribute is identifying or descriptive?

Oberon00 commented 3 years ago

So I guess I just can't imagine a backend handling Resource in a special way compared to attributes

I think it's easy to imagine. E.g. from a performance perspective, for each batch of spans, you only have to look at the resources once. From a semantic perspective I think it's cleaner to have separate places describing entity and operation/metric/log. From a coding perspective, it would probably be harder to keep e.g. metric labels and span attributes describing the entity emitting telemetry in sync, than just ensuring that the Meter and the TracerProvider have the same resource associated.

Current spec:

A Resource is an immutable representation of the entity producing telemetry [...].

jsuereth commented 3 years ago

Respoding to @anuraaga

Is it only resource though? There are many attributes which I think also are meant for linking telemetry. For example, rpc method seems like one that will be used to jump between signals - I think this is what's considered identifying in this proposal.

I think the difference here is Resource is used to identify the "SignalProvider" (TraceProvider/MetricProvider) that created the metric and link between it and traces produced by the TraceProvider. IN that view RpcMethod likely isn't identifying to this configured Provider.

The point is taken around joining signals being an exercise in labels. I'd say two things are important here:

  1. Resource represents a minimum of joinable identifying labels. You may have better ones.
  2. Resource is guaranteed to exist for an {Signal}Provider in openteletmetry and should provide that minimum of joinable identifiers required to reconsistute which telemetry is associated.

I'm not sure #2 is an actuality (i'd like it to be true, but I forget the specifics around Resource to know if you can wiggle out of it).

iNikem commented 3 years ago

Thank everybody for this lovely discussion! As far as the original ask to have delayed attributes is satisfied, I am happy to have only a subset of Resource attributes ("Descriptive" ones) delay-able.

Optional attributes should be collected before any "observer" metrics are collected.

But I am suspicious about this proposal by @jsuereth . First, I don't like this "you can delay some attributes but only by this short period of time". Second, IMO it creates unnecessary dependency on metrics side of the SDK. But I may be wrong here :)

Oberon00 commented 3 years ago

Optional attributes should be collected before any "observer" metrics are collected.

But I am suspicious about this proposal by @jsuereth . First, I don't like this "you can delay some attributes but only by this short period of time". Second, IMO it creates unnecessary dependency on metrics side of the SDK.

I don't think we should place any requirements on descriptive attributes being collected before anything else, but only identifying attributes.

For observers in particular, I think we should leave them out of the first version of the metric spec altogether, #1433. And if we have them, maybe the first metric export could happen before any observers are queried, either because there are none or because observer collection and metric export is decoupled, #1432).

I can definitely see the use case for "freezing" the identifying resources at some point in time though. There could be some way to signal the Resource SDK "I will deliver some resources later" (delayedAttributePledge = BeginAddDelayedAttributes(maybeWithTimeout)) and correspondingly "now I have them/they won't come after all" (delayedAttributePledge .FulfillWith(myMaybeEmptyDelayedAttributes)). That could be exposed as a property (IsResourceComplete) and/or with a callback (OnResourceCompleted). Exporters then might want to wait with the first export until OnResourceCompleted fired.

Oberon00 commented 3 years ago

I just had a realization that current resources can be thought of as a special kind of "never-changing" metrics, and "mutable resource attributes" may be better represented as metrics:

As far as I understand, this issue goes into the direction of relaxing resources from "write once at startup" to being "append-only". But there are some attributes that can change an arbitrary number of times. For example, #1647 brings up mobile networks (signal strenght, 2G/3G/4G, current cell), but we also have IP addresses, maybe current location, POD labels, CPU temperature. These have in common that the can change any time and may affect, i.e. be interesting for interpreting, all spans/metrics that are emitted during the time they apply. Especially when we look at the last example (CPU temperature), it appears to me that some attributes you might initially think of as resource attributes would indeed be better represented as metrics. I don't know if we have string-valued metrics, but I'm sure one could work around that with using some dummy value of zero/one and and a label with the actual value.

CC @jmacd

CeeBeeCee commented 3 years ago

Something to think about. If we're manually instrumenting, this is not much of an issue. We are able to read env vars or AWS tags or container ids etc or whatever before creating the Resource object and then associate with a TracerProvider. This is more of a problem with auto-instrumentation where we may not have all the attributes at start up time. Perhaps we need a little leeway here?

One thought that comes to mind is creating APIs as per the semantic convention for each resource domain (cloud, containers, k8s, os etc) and then having implementations for them (cloud:gcp, cloud:aws, os:windows, os:linux etc). I see that AWS is doing something along these lines by creating an extension in Java to get and populate resource attributes for almost all their run times. Perhaps the auto-instrumentation libraries can take in params at start up indicating the runtime and invoking the necessary implementation?

Example in Java of how this could work : -Dotel.resource.attributes=service.name=myjavaservice -Dotel.cloud.provider=aws -Dotel.cloud.platform=ECS

https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/aws/src/main/java/io/opentelemetry/sdk/extension/aws/resource