open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
337 stars 164 forks source link

Telemetry Schema and Resource #161

Open jsuereth opened 3 years ago

jsuereth commented 3 years ago

Specify a mechanism for users of Resource to interact with Resource leveragingTelemetrySchema version for a stable API.

Goals of this OTEP:

MadVikingGod commented 3 years ago

Some general questions I have about this proposal:

  1. Does this need to be part of the SDK?
  2. Do we expect this to be used by most users of the SDK? I can see the use in the collector, and backends, but other then resources what does an end user application or an instrumentation library get with this feature?
  3. Could this be accomplished outside of the SDK, or maybe as a feature of the collector?
  4. Does that mean that only URLs defined by the spec are the only valid ones, what if I want to make a convention for Digital Ocean that isn't part of current specs?
    • What does newer mean in such a world? Is otel v1.4.0 newer than Oracle v1.1.2?
    • What does compatible mean in such a world? Is otel v1.4.0 compatible with Oracle v1.1.2?
    • How does one convert from Otel v1.4.0 to Oracle v1.1.2?
  5. Does this mean where ever this code lies (it's proposed for the SDK but see 1) we have to have some way to process these conversions? This has lots of implication like:
    • How does it do forward conversions for specs after the current version was released?
    • How do we limit the amount of code we introduce by updating the conventions?
    • How do we know what supported conversions are available for the current library?
tigrannajaryan commented 3 years ago

Do we expect this to be used by most users of the SDK? I can see the use in the collector, and backends, but other then resources what does an end user application or an instrumentation library get with this feature?

It will be used by the SDK itself to correctly merge the resources. It can be also used by an external code to do schema translation (e.g. the Collector or a backend).

Could this be accomplished outside of the SDK, or maybe as a feature of the collector? Schema conversion, once the data leaves the SDK, yes, can be done by the Collector (and it is a planned feature). However, the merging of the resources must happen in the SDK, there does not seem to be a way to avoid that.

Does that mean that only URLs defined by the spec are the only valid ones, what if I want to make a convention for Digital Ocean that isn't part of current specs?

Yes, you can have your own schema that is not Otel schema. You will need to choose a schema family URL and publish the schema files for each version you release.

What does newer mean in such a world? Is otel v1.4.0 newer than Oracle v1.1.2?

They are different schema families, not comparable. "Newer" only makes sense for schemas that belong to the same family.

What does compatible mean in such a world? Is otel v1.4.0 compatible with Oracle v1.1.2?

No, not compatible, since they are different families.

How does one convert from Otel v1.4.0 to Oracle v1.1.2?

You can't.

Note: I planned a concept of parent schemas for the future. This would allow deriving Oracle schema from Otel schema and in that case possibly allow conversion. But it is not part of the current Schemas concept yet.

How do we limit the amount of code we introduce by updating the conventions?

There is no code needed to be introduced when you update the conventions. The schema conversion code is capable of generically handling changes to convention. The rules defined in schema file drive this generic schema conversion code.

MadVikingGod commented 3 years ago

There is no code needed to be introduced when you update the conventions. The schema conversion code is capable of generically handling changes to convention. The rules defined in schema file drive this generic schema conversion code.

I'm trying to understand how an actual implementation would do this. So does this mean that we are going to either embed the schema file or go fetch the schema file when needed? Then the code that would be written for this would be an interpreter of the schema to dynamically convert one version to the next?

tigrannajaryan commented 3 years ago

There is no code needed to be introduced when you update the conventions. The schema conversion code is capable of generically handling changes to convention. The rules defined in schema file drive this generic schema conversion code.

I'm trying to understand how an actual implementation would do this. So does this mean that we are going to either embed the schema file or go fetch the schema file when needed?

Both are valid approaches.

Then the code that would be written for this would be an interpreter of the schema to dynamically convert one version to the next?

Yes, exactly.

If you haven't read the Schemas OTEP it may be useful to do it since it answers some of your questions: https://github.com/open-telemetry/oteps/blob/main/text/0152-telemetry-schemas.md

jsuereth commented 3 years ago

@MadVikingGod

Some general questions I have about this proposal: Does this need to be part of the SDK?

Yes, we need this for GCP exporters.

Do we expect this to be used by most users of the SDK? I can see the use in the collector, and backends, but other then resources what does an end user application or an instrumentation library get with this feature?

SDK != End users. Instrumentation library = API. This needs to be in the SDK for exporters, and it's a code-only thing. End users should be interacting MOSTLY via configuration or 'code configuration' on the sdk.

Could this be accomplished outside of the SDK, or maybe as a feature of the collector?

We also need this in the collector. This is an interface that should be provided to consumers of Resource in the SDK.

Does that mean that only URLs defined by the spec are the only valid ones, what if I want to make a convention for Digital Ocean that isn't part of current specs? What does newer mean in such a world? Is otel v1.4.0 newer than Oracle v1.1.2? What does compatible mean in such a world? Is otel v1.4.0 compatible with Oracle v1.1.2? How does one convert from Otel v1.4.0 to Oracle v1.1.2?

I think tigran responded to most of these.

Does this mean where ever this code lies (it's proposed for the SDK but see 1) we have to have some way to process these conversions? This has lots of implication like: How does it do forward conversions for specs after the current version was released? How do we limit the amount of code we introduce by updating the conventions? How do we know what supported conversions are available for the current library?

Totally agree these are good questions. I see this migration being a temporary solution to prevent ecosystem breakage as everyone adopts the latest telemetry schema. Let's say a user is using the following components:

If we take the approach we have today for Resources, it means I need to upgrade them ALL in lockstop for correct behavior. To mitigate this, we've kept Resource detection components + GCP exporters in our own repositories and release them together, so it's easier to know if you have an issue.

However, if folks use any other Resource Detector component besides the one we provide, even if it abides by semantic conventions, we can't tell if we're compatible. Renaming attributes is silent breakage to some extent, and worse if the exporter is unable to declare which version of the schema it was using.

MadVikingGod commented 3 years ago

So, let me start off I like the ideas presented here, but I think there are a few assumptions that have been made that aren't stated, and it's a bit light on how external conventions interact with the conventions discussed.

And as I'm thinking more on this I think we are solving two separate issues, How to merge Resource Conventions, and How to merge Telemetry Conventions. They are both the same problem, but where we can solve them is different, as well as how we should solve them might be different. We might want to split the issue of resource migration and telemetry migration into different proposals.

The major concern right now with useability. If I want to guarantee any single convention outside of the current semantic conventions, anything that uses that is unable to be mixed in with other data. Meaning even if my convention is a superset of a semconv there is no way to upgrade other libraries to my convention. Nor can I downgrade to a different family of conventions.

The other large concern is that this seems to be leading us to dynamic processing of arbitrary schemas. While this can be made safe, it is a very challenging task to do so safely, and I would highly recommend against it.

SDK != End users. Instrumentation library = API. Actually the End User is the only one who gets to choose the SDK and is the only entity guaranteed to have it at all. As I've been asking about I think especially for the case of telemetry semconvs we don't need it in the SDK, and adding it for an exporter that may live in contrib seems like the wrong way to reason about this.

For the resource portion, the user/library writers will know when they write the application what conventions will be used. I think we shouldn't have any complicated logic around transforming those but push for a way for an application author to know:

  1. At compile time that different conventions that may be used aren't stepping on each other.
  2. At upgrade time that taking a newer version doesn't mask something they have done.

For the telemetry portion, I think a stand-alone library is much more appropriate. SDKs don't provide any ingest logic and this is an ingest problem.

jsuereth commented 3 years ago

@MadVikingGod You state:

For the telemetry portion, I think a stand-alone library is much more appropriate. SDKs don't provide any ingest logic and this is an ingest problem.

I'd disagree with the later part. Exporters are ingestion adapters and such this is their problem and they live as SDK components. I'd be ok if this were a (stable) library an exporter can rely on, but it needs to be baked in.

tigrannajaryan commented 3 years ago

@jsuereth do you plan to continue working on this OTEP?

jsuereth commented 3 years ago

@tigrannajaryan Yes, tabled it temporarily to work more directly with metrics. I still think this is an unresolved issue w/ semantic conventions + telemetry.

tigrannajaryan commented 3 years ago

@tigrannajaryan Yes, tabled it temporarily to work more directly with metrics. I still think this is an unresolved issue w/ semantic conventions + telemetry.

@jsuereth I believe we need this change. It evolves the schemas in the right direction.

tigrannajaryan commented 2 years ago

@jsuereth do you plan to continue working on this OTEP?

tedsuo commented 1 year ago

@jsuereth I assume we still need this OTEP?

jsuereth commented 1 year ago

@tedsuo I think we still need it but I've put effort on hold to address higher-priority items. I think we still have time to revisit this post some semconv stability, and this needs two more approvers.

tedsuo commented 1 year ago

@jsuereth we are cleaning up stale OTEP PRs. If there is no further action at this time, we will close this PR in one week. Feel free to open it again when it is time to pick it back up.