open-telemetry / opentelemetry-proto

OpenTelemetry protocol (OTLP) specification and Protobuf definitions
https://opentelemetry.io/docs/specs/otlp/
Apache License 2.0
545 stars 242 forks source link

KeyValueList and log attributes with duplicate keys as undefined behavior #533

Closed pellared closed 3 months ago

pellared commented 3 months ago

Towards https://github.com/open-telemetry/opentelemetry-specification/issues/3931

Related to https://github.com/open-telemetry/opentelemetry-specification/pull/3938

I think it is not a breaking change as this does not change anything in the encoding. The receivers already have to handle/validate key-values with duplicate keys.

Implementing de-duplication decreases performance. More:

Performance may be more important for applications instrumenting their code than even losing log records which will contain duplicates (which are very unlikely).

The OTLP exporters deduplicate key-val pairs before sending the data to satisfy receivers that require them to have unique keys. Alternatively, the SDKs can have a processor which deduplicate key-val pairs.

austinlparker commented 3 months ago

While this might not be an explicit breaking change, it seems like it will certainly be an implicit one. Is there no other way to address this?

pellared commented 3 months ago

Then it was changed to the key/value list due for performance improvements.

This proposal is also due for performance improvements.

the OTel Collector is built on top of the key uniqueness requirement

Is this something that can be changed if needed in future? In my opinion, for now it is not necessary.

dmitryax commented 3 months ago

This proposal is also due for performance improvements

I think the data consistency and deterministic processing across the OpenTelemetry project are more important than the performance degradation of one instrumentation library.

pellared commented 3 months ago

I think the data consistency and deterministic processing across the OpenTelemetry project are more important than the performance degradation of one instrumentation library.

It is not only performance degradation of one instrumentation library but whole signal's (e.g. logs) processing pipeline.

EDIT: The issue is valid not only when the duplication actually exists. The SDK simply has to ensure that there are no duplicates. For instance in Go Logs API we pass attributes as array (and optional slice) to decrease the number of heap allocations. The SDK would need to use a map to remove duplicated attributes.

MrAlias commented 3 months ago

This proposal is also due for performance improvements

I think the data consistency and deterministic processing across the OpenTelemetry project are more important than the performance degradation of one instrumentation library.

There are ~two~ three known languages that need this currently (Go, Rust, and C++). It is unfair to say this is only for a single instrumentation library.

MrAlias commented 3 months ago

It seems like if the expectation is for the sender to de-duplicate the attributes prior to sending the data it should be achievable on the receiver side as well, right?

MrAlias commented 3 months ago

cc @jmacd

pellared commented 3 months ago

@mx-psi @TylerHelmuth @dmitryax

I want to point out that this PR does NOT require making any changes in the Collector if the last (or any) item with the same key will be preserved. The main point is just to allow passing duplicate keys and make sure that it does not break the receiver as this would allow performance improvements in SDKs.

EDIT: It would be even acceptable if the collector drop attributes with duplicated keys, but I do not imagine that this happens 😉

cijothomas commented 3 months ago

This proposal is also due for performance improvements

I think the data consistency and deterministic processing across the OpenTelemetry project are more important than the performance degradation of one instrumentation library.

There are ~two~ three known languages that need this currently (Go, Rust, and C++). It is unfair to say this is only for a single instrumentation library.

OTel .NET too. (OTel .NET does not do de-duplication today for logs today).

pellared commented 3 months ago

Added to PR description:

The OTLP exporters can have a configuration to deduplicate key-val pairs before sending the data to satisfy receivers that require them to have unique keys.

mx-psi commented 3 months ago

Thanks for making this more in line with JSON. The new wording allows for a backwards-compatible implementation in the Collector. However, I still don't think we have a satisfactory solution (I don't think there even is one).

Collector components will at most be able to forward duplicate keys, but no matter what implementation we pick components won't be able to interact with duplicate keys in any meaningful way. If an application wants to support duplicate keys in a meaningful way, they won't be able to use the Collector as a library for implementing their OTLP server.

There already are a number of applications from vendors and non-commercial FOSS that use the Collector as a library (Jaeger, Grafana Tempo, Datadog Agent, AWS Cloudwatch Agent, Elastic Agent, all vendor-specific Collector distros...). All of these applications won't be able to interact with these duplicate keys. If a sizeable part of the OpenTelemetry ecosystem is not able to use handle this feature correctly at all, what's the point of making this change?

pellared commented 3 months ago

If a sizeable part of the OpenTelemetry ecosystem is not able to use this feature at all, what's the point of making this change?

Because it gives significant performance improvements for the SDKs. It would not require them to handle de-duplication.

mx-psi commented 3 months ago

Personally, I don't think a performance improvement justifies what we may as well call a correctness issue on the Collector side

austinlparker commented 3 months ago

For the sake of discussion, is there any data about the actual performance impact of deduplication in the log pipeline at an SDK level?

Furthermore, the OpenTelemetry model is fundamentally built on blending multiple signals together. While I can certainly believe that existing logging patterns may rely on duplicate keys in log messages, is this a pattern we expect to hold going forward?

jsuereth commented 3 months ago

I'm very much against this change. This breaks a known invariant and forces all backends to deal with the issue. I don't think the benefit outweighs the overall cost.

austinlparker commented 3 months ago

I'm very much against this change. This breaks a known invariant and forces all backends to deal with the issue. I don't think the benefit outweighs the overall cost.

To echo this, my interest in this topic is that it would also have implications for every single vendor or tool that ingests OTLP. The scope of this change is massive, which means the burden of proof needs to be commensurate.

pellared commented 3 months ago

For the sake of discussion, is there any data about the actual performance impact of deduplication in the log pipeline at an SDK level?

Please read the description. There is a benchmark for Rust which indicates 30% performance improvement.

Side note:

Furthermore, the OpenTelemetry model is fundamentally built on blending multiple signals together. While I can certainly believe that existing logging patterns may rely on duplicate keys in log messages, is this a pattern we expect to hold going forward?

That is a good question. I updated the proto so that it currently only should affect logs. But I see that this change may be a precedence for further requests e.g. to do the same for metrics metadata.

I am changing the PR to draft to communicate that at this point of time I do not plan to push this further.

Thanks everyone for your feedback.

Still, the change in the Logs Data Model would be beneficial even if the OTLP would require unique keys as other exporters can benefit from this. I will follow-up in https://github.com/open-telemetry/opentelemetry-specification/issues/3931.

pellared commented 3 months ago

I'm very much against this change. This breaks a known invariant and forces all backends to deal with the issue. I don't think the benefit outweighs the overall cost.

I do not understand this statement as right now it is still possible to send duplicated key-values using OTLP. The backends need to deal with this anyway. The PR just proposes to mark this as undefined behavior, because it is technically possible to send it. Not checking for duplicates and improving performance may be more important for instrumentations than even losing log records which will contain duplicates. Especially given that such duplications are rare in real-world and checking for duplicates always causes an overhead.

I decided to reopen the PR as I think it is very coupled to https://github.com/open-telemetry/opentelemetry-specification/pull/3938.

tigrannajaryan commented 3 months ago

The PR just proposes to mark this as undefined behavior, because it is technically possible to send it.

@pellared This is unnecessary. There are many other payloads which are technically possible to send but are wrong because they violate a certain specified invariant. Unlike other specifications (e.g. C++ standard) the undefined behavior in OTLP spec is not explicitly specified. In this spec a behavior is "undefined" simply by being unspecified.

As an example: we have AGGREGATION_TEMPORALITY_UNSPECIFIED value of AggregationTemporality. The spec says that it must not be used. Technically nothing prevents senders from using it. However, the spec is silent about what happens when receivers see this value. It is an undefined behavior. It is implied that the behavior of the receivers when they see this value is not known.

pellared commented 3 months ago

Closing as unnecessary.

tigrannajaryan commented 3 months ago

Although this is now closed I would like to comment on it to clarify an important principle that we can use to evaluate change proposals like this.

This change says that a payload that was previously non-compliant is now possible. It then gives a leeway to receivers to behave as they wish with such payloads, including to reject they payload or to accept and process it in some reasonable way.

I think such approach is fundamentally not compatible with OpenTelemetry's value of vendor neutrality. Allowing such leeway can result in critical differences in behaviors by different vendor backends, making it impossible for the OpenTelemetry user to switch vendors as they desire. If one vendor happily accepts your data and the other outright rejects it then I think we are failing the interoperability goal. This is bad and goes against OpenTelemetry principles.

That alone I think is sufficient for me to reject the change.