[Schema] Converting values that cause conflicts

MovieStoreGuy commented 1 year ago

What are you trying to achieve? When building the the schema processor for the collector, it was noted that there is no defined behaviour on what to expect if there is an existing value that would be overridden.

What did you expect to see?

I've currently made the assumption that the schema conversions take priority over existing values and return an error that can be consumed by the collector signify a metric (or log) on how many conflicts were found and report that.

Additional context.

Context: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/17020#discussion_r1190154309

jsuereth commented 1 year ago

We discussed this in the Semantic Convention WG.

Suggestions from there:

We think this may happen when folks try to emit "multiple" versions of Schema to alleviate migration friction. We don't have hooks in place to deal with this.
We suggest leaving the existing value and NOT overwriting with the change when this scenario occurs.

Specifically, if you have the following attributes:

namespace.attribute1=foo
namespace.attribute2=bar

If you have a rename_attributes of namespace.attribute1: namespace.attribute2, then you would wind up with:

namspace.attribute2=bar

That is, attribute1 does NOT overwrite attribute2, regardless of how the attributes are ordered.

This is specifically in reference to the OTLP generated.

Additionally, you could leave attribute1 in place as well.

tigrannajaryan commented 1 year ago

We suggest leaving the existing value and NOT overwriting with the change when this scenario occurs.

What exactly does this mean? Do you suggest that the input attributes are preserved and the SchemaURL is not updated? What happens with transforms that were already applied? Do you have to roll them back? This is very expensive to implement.

I don't think this is the best approach.

I submitted a PR where I recommend overwriting. IMO this is the best default approach.

Rejecting is complex and expensive. It requires performing transformation on a copy of data since it may be aborted any time or be able to rollback already applied changes. This is very expensive unless you have some very efficient copy-on-write data structure or something like that. I don't think we should do it.

Overwriting is much simpler. You just apply on the original data knowing that there is no need to ever abort the changes. IMO, this should be the philosophy of all schema transforms (existing or future proposals).

I left the door open for implementations that want to carry the burden of much more complex approaches by using SHOULD clause instead of MUST.

tigrannajaryan commented 1 year ago

If you have a rename_attributes of namespace.attribute1: namespace.attribute2, then you would wind up with:
namspace.attribute2=bar
That is, attribute1 does NOT overwrite attribute2, regardless of how the attributes are ordered.

I think this is a bad outcome. The resulting data complies neither with old schema nor with new schema. What do we set the SchemaURL to in this case?

lmolkova commented 1 year ago

What exactly does this mean? Do you suggest that the input attributes are preserved and the SchemaURL is not updated? What happens with transforms that were already applied? Do you have to roll them back? This is very expensive to implement.

We can remove schema URL from invalid telemetry or add a way to signal that we could not apply transformation on a given data. I think having a way to signal about it is essential for this and other similar problems and should be a prerequisite to removing moratorium on transformations.

tigrannajaryan commented 1 year ago

To summarize some options (there is probably more possibilities):

Overwrite

Pros

Likely simplest and most efficient to implement.

Cons

Can lose (overwrite) data if it is not Otel compliant (custom attributes incorrectly using Otel namespaces)

Ignore, Skip

Pros

Similarly simple to implement as Overwrite (maybe a bit more complex to be careful to check existing attributes)

Cons

Can result in output data that is not compliant with any Otel schema, part of attributes complying with old schema, part with new.

Preserve and Overwrite

Before overwriting preserve the value of the conflicting attribute in some other attribute (can be a bag of attributes like that or can have some renaming rules).

Pros

Also reasonably simple to implement (a bit more complex than Overwrite).
No data loss.

Cons

May not be immediately visible where the non-compliant is moved (custom attributes incorrectly using Otel namespaces)

[UPDATE 2023-05-17] A possible way to preserve the conflicting attribute is to rename it and append a suffix equal to the source schema version number. For example when converting the data from 1.0.0 to 1.1.0 we would append a suffix 1.0.0 to the attribute name. Assuming 1.0.0->1.1.0 renames host to host.name and we have the following input data:

{
  "attributes": { "host":  "spool", "host.name": "spool.example.com" }
}

The output data will become:

{
  "attributes": { "host.name":  "spool", "host.name_1.0.0": "spool.example.com" }
}

This approach allows applying multiple version transformations and preserve all conflicting attributes such that it is also captured in a human-readable form which "previous" version of the schema the original data was when the conflict was detected and thus can be a useful hint for understanding the supposed semantics of the original data.

Another possible way to preserve the conflicting attributes would be to put them all as sub-keys under some well-known attribute, however it is currently not allowed to have nested attributes (see https://github.com/open-telemetry/opentelemetry-specification/issues/376)

Detect and Reject/Rollback

Pros

Guarantee that output data is either not touched or converted to new schema without any data loss

Cons

Hard and/or inefficient to implement. Need a data structure that can be rollbacked or need to work on a copy, etc.

lmolkova commented 1 year ago

Ignore, Skip option with a way to notify about broken data could be a first step that does not seem to block supporting Detect and Reject/Rollback in future

tigrannajaryan commented 1 year ago

We can remove schema URL from invalid telemetry or add a way to signal that we could not apply transformation on a given data.

I think that's worse than overwriting a non-compliant attribute. To me the offender who incorrectly uses a custom attribute name should be punished, instead we chose for a collateral damage and loose the information about schema :-)

lmolkova commented 1 year ago

To me the offender who incorrectly uses a custom attribute name should be punished

An application developer (e.g. an early adopter) who has added an attribute that is now used in their alerts, dashboards and business reports should have it all broken because now some automation in their vendor ingestion pipeline decided schema is more important?

tigrannajaryan commented 1 year ago

Ignore, Skip option with a way to notify about broken data could be a first step that does not seem to block supporting Detect and Reject/Rollback in future

A combination of Overwrite+Detect is also possible. Detect that a conflicting attribute exists, rename it to something that preserves it and then overwrite. This way there is no loss of data, but we still forcefully make the data compliant with the target schema version.

lmolkova commented 1 year ago

renaming an attribute set by application for the sake of compliance with spec and breaking their tooling does not look reasonable.

lmolkova commented 1 year ago

If some important UX depends on compliance with schema, such users will suffer from not getting this experience and will go and fix it themselves.

tigrannajaryan commented 1 year ago

To me the offender who incorrectly uses a custom attribute name should be punished

An application developer (e.g. an early adopter) who has added an attribute that is now used in their alerts, dashboards and business reports should have it all broken because now some automation in their vendor ingestion pipeline decided schema is more important?

Yes, I think so. The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen (I don't see what does it have to do with being an early adopter). If they don't follow the rules there are consequences.

Obviously, this is just one opinion, if we don't find one good approach that we all like then perhaps this needs to be a configurable option.

tigrannajaryan commented 1 year ago

Also, a reminder to make sure we are aware of the context of the discussion: the data loss is only applicable to use-cases where the data is actually transformed before storing, e.g. in a Collector transformer processor or equivalent "before-storage" transformer.

Schemas also can be used for query-time transformations where there is no data loss. The "overwrite" rule merely defines how the queries are rewritten.

lmolkova commented 1 year ago

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

tigrannajaryan commented 1 year ago

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

See Recommendations for Application Developers.

lmolkova commented 1 year ago

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

See Recommendations for Application Developers.

It does not tell me I cannot add http.foo.bar attribute in my application and does not contain any normative language.

lmolkova commented 1 year ago

and as an early adopter I'm free to follow the advice here "The name may be generally applicable to applications in the industry. In that case consider submitting a proposal to this specification to add a new name to the semantic conventions, and if necessary also to add a new namespace." and create http.foo.bar, send the PR to the spec and eventually it might make it to the spec. I should be able to use it before it officially happens.

tigrannajaryan commented 1 year ago

The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen

Can you point me to this section? I don't remember any limitations on custom attributes, reserving namespaces or anything else.

See Recommendations for Application Developers.

It does not tell me I cannot add http.foo.bar attribute in my application and does not contain any normative language.

I think you are right, that page is not well written. I know that the intent was precisely to avoid the conflicts of custom attributes with Otel names. We need to fix it to make it clearer and use normative language.

pyohannes commented 1 year ago

Yes, I think so. The spec clearly says how to use custom attributes and if you follow the rules nothing like that can happen (I don't see what does it have to do with being an early adopter). If they don't follow the rules there are consequences.

For the Overwrite approach, the consequence for not following the rules is data-loss. For the Skip/Ignore approach, the consequence for not following the rules is data that is only partly schema compliant.

In my opinion, the latter is more user-friendly and less invasive. Many of the OTel users I work with would be very critical of their telemetry data being dropped because they don't follow certain "recommendations".

tigrannajaryan commented 1 year ago

In my opinion, the latter is more user-friendly and less invasive. Many of the OTel users I work with would be very critical of their telemetry data being dropped because they don't follow certain "recommendations".

We drop data elsewhere too if it does not fit the requirements, e.g. drop attributes if the count exceeds the configured limit. We record the fact of dropping, but don't keep a copy of dropped data. In that case too the consequence of not following the rules is data-loss. If it is acceptable for setting the attributes in the API then I think it is similarly acceptable for transformations in the Collector.

That said, I don't think that's the best we can do. It seems to me "Preserve and Overwrite" may be a better option. We will preserve the data (although move to a different place, but readily accessible) and also end up with a consistent shape of the output data, complying with the target schema. What does everyone think about this approach?

tigrannajaryan commented 1 year ago

@open-telemetry/specs-approvers any thoughts on the possible options outlined above?

mx-psi commented 1 year ago

"Preserve and Overwrite" is the option that makes the most sense to me, either with an explicit bag of attributes or just with a count of the overwritten attributes. Producing non-compliant data as "Ignore" would do seems like a no-go to me, and it is unclear to me how we would implement "Detect and Reject/Rollback".

tigrannajaryan commented 1 year ago

I updated "Preserve and Overwrite" option above to show how we can possibly preserve the conflicting data. I am inclining towards this solution, it seem the best available from the 4 listed options so far.

lmolkova commented 1 year ago

"Preserve and Override" changes the name of an attribute, which can break user dashboards, alerts, and queries. Given that schema transformation may be part of the vendor ingestion pipeline, users won't even control it. I.e. this change would be breaking for existing user infra build on top of this data.

When conversion happens inside user pipeline, investigating why this data was changed and why everything is broken will be difficult.

So it seems, we're ready to break user tooling for the sake of compliance with later version of otel schema. To me, the benefit does not outweigh the risk and I'm still on "Ignore, Skip" + warn.

lmolkova commented 1 year ago

Thinking more about it, it made me realize that schema transformation by default should preserve original attributes to avoid breaking existing custom tooling built on top of such attributes - https://github.com/open-telemetry/opentelemetry-specification/issues/3510

mx-psi commented 1 year ago

Thinking more about it, it made me realize that schema transformation by default should preserve original attributes to avoid breaking existing custom tooling built on top of such attributes

I don't have a strong opinion on this (one could also argue that vendors are the ones that need to handle that) but I feel like this is independent from this issue; we should discuss this issue assuming the current state of the spec which AIUI is that original attributes are removed.

lmolkova commented 1 year ago

We drop data elsewhere too if it does not fit the requirements, e.g. drop attributes if the count exceeds the configured limit. We record the fact of dropping, but don't keep a copy of dropped data. In that case too the consequence of not following the rules is data-loss. If it is acceptable for setting the attributes in the API then I think it is similarly acceptable for transformations in the Collector.

the difference is that current behavior happens in-process, it's stable and easy to discover with tests. We also give users the ability to control this behavior by configuring limits.

In case of schema transformation, this behavior becomes unstable - someone changes to latest version (either it's processor inside user's collector pipeline, or in the vendor pipeline) and accidentally your data is dropped or moved. This is never discovered by tests because happens much later and is out-of-process.

I suggest the following in #3510

by default schema transformation preserves all original attributes as they are. It does not rename but adds new ones.
- if there is a collision it preserves the old value.
it has a mode when it's allowed to break stuff, there it can drop, rename or do anything else. This is opt-in. Vendors are free to opt-in their users in distros.
In any case, we need a warning/notification mechanism for schema transformation so it warns about collisions and unmet expectations (with throttling).

tigrannajaryan commented 1 year ago

by default schema transformation preserves all original attributes as they are. It does not rename but adds new ones.

if there is a collision it preserves the old value.

@lmolkova From what I understand this is the same as "Ignore, Skip" approach described above. Can you clarify what you suggest to happen with SchemaURL in there is a collision? Do we update the SchemaURL or keep the old value? Also, how do we address the fact that the data as a whole after such partially applied transformation does not comply with any Otel semconv version, neither the old one nor the new one?

lmolkova commented 1 year ago

@lmolkova From what I understand this is the same as "Ignore, Skip" approach described above.

Correct.

~~I meant preserving the old value on the same attribute. I.e.~~

~~namespace.attribute1=foo~~ ~~namespace.attribute2=bar~~ ~~If you have a rename_attributes of namespace.attribute1: namespace.attribute2, then you would wind up with:~~

~~namespace.attribute1=foo~~ ~~namespace.attribute2=bar~~

Can you clarify what you suggest to happen with SchemaURL in there is a collision?

A few options:

Remove the schema URL or register an invalid value. E.g. otel.io/schemas/inconistent, otel.io/schemas/unknown, otel.io/schemas/undetermined
roll back.
- With default mode that does not remove old attributes, it should be easier.
- With opt-in "break things mode", there is no problem - put a new schema URL and modify telemetry as needed

if we cannot provide reasonable defaults, stability guarantees, performance, and observability into edge cases for schema transformation, it means it should just stay experimental and not be used for stable conventions. I assume the bar for attribute renames to be very high after stability is declared and don't quite understand why schema transformation will be necessary.

tigrannajaryan commented 1 year ago

I spent some time to get a better feel for what it would take to implement the "Rollback" option and what the impact will be.

I wrote a toy implementation and run some benchmarks with rollback-ability enabled and disabled.

Toy source code here: https://github.com/tigrannajaryan/telemetry-schema/blob/main/schema/compiled/common_actions.go

The schema converter creates a changelog that can be used for rolling the changes back (for whatever reason, including if conflicts are detected).

Here are the benchmarking results:

name                                    time/op
ConvertSchema/Spans/Non-rollbackable-8   101µs ± 4%
ConvertSchema/Spans/Rollbackable-8       143µs ±12%
Unmarshal/Spans-8                        518µs ± 4%
Marshal/Spans-8                          227µs ± 7%

The first too lines is schema conversion of 100 spans, with and without rollback capability enabled (but without doing any rollback, this just creates the changelog). It tests only the success case, when there are no conflicts and, there is no need to rollback. I believe this is the most important case to care about from performance perspective.

As you can see enabling changelog adds about 40µs (per 100 spans) and roughly 1.4 times slows down the conversion process. As a reference point I also included benchmarks for marshalling/unmarshall the exact same data, which is roughly what a Collector with otlp receiver, a schema processor and an otlp exporter would do.

As a rough number enabling changelog would add overall about 5% of CPU time, which is not insignificant, but I don't think it is a dealbreaker.

Furthermore, I think there is room to improve performance of the changelog if needed (at the cost of complexity and maintainability).

Also take the benchmark with a huge grain of salt. It is just one scenario, with one particular schema conversion. We can validate this further, but for now it is enough for me to get a very approximate sense of expected performance.

So, given all the above, in my opinion, performance is not going to be the decisive factor. At the moment I am more concerned with the complexity of implementation of the changelog and the opportunity to introduce bugs. IMO, it is not terribly complicated, but still relatively easy to make a mistake, and especially since it is in a code that is exercised rarely (rollbacks will be rare) the bugs introduced may be hard to notice.

To be clear: I would prefer the "Rollback" option since it results in the best functional outcome IMO: no corrupt data belonging to no schema, clear ability to see the errors if needed, etc.

I would like an opinion of @open-telemetry/collector-approvers @open-telemetry/collector-contrib-approvers @open-telemetry/collector-maintainers @open-telemetry/collector-contrib-maintainer on this.

MovieStoreGuy commented 1 year ago

To be clear: I would prefer the "Rollback" option since it results in the best functional outcome IMO: no corrupt data belonging to no schema, clear ability to see the errors if needed, etc.

I worry about rollbacks because it feels like silently failing but with more steps, I also worry about the cognitive burden on developing schemas since it could be possible to add in a conflict without realising it.

If this has been deployed on collector gateway so data mutations happen on egress to vendor, but a conflict caused rollbacks of the data then my expected visualisations and alerts are broken if I am expecting them to be of a fixed version.

Not sure of the likelihood of this scenario but I can see it becoming a higher chance of happening while transition from manual instrumentation to a hybrid of auto instrumentation.

MovieStoreGuy commented 1 year ago

@tigrannajaryan , let me bring this to the sig.

tigrannajaryan commented 1 year ago

@tigrannajaryan , let me bring this to the sig.

@MovieStoreGuy thanks, please do.

dmitryax commented 1 year ago

The level of complexity required to support the rollback mechanism seems excessive to me, considering that such conflicts should be infrequent occurrences. Additionally, if I configure the schema transform processor to generate a specific schema version, I would expect the processor to make its best attempt to translate the data to that version. Otherwise, I would opt to use a different processor and modify the data accordingly.

I believe we can produce the data as close to the required state as possible. Even if we encounter a conflict, we can utilize the value set by the user, which may not precisely adhere to the specification but will, at least, have the correct attribute name. We might encounter this situation regardless in the following scenario:

We have a rename_attributes of namespace.attribute1: namespace.attribute2 for version 1 ->1.1` translation
User sends namespace.attribute2=bar with the schema incorrectly identified as version 1

In that case, translation isn't triggered, and we simply update the schema from version 1 to 1.1. Therefore, I believe the option originally suggested by @lmolkova and @jsuereth (Ignore, Skip in the list) is reasonable, but I'm not sure if it should be the default behavior. It can be a boolean configuration option like override_existing.

open-telemetry / opentelemetry-specification