Data Classifications for resources and attributes

MovieStoreGuy commented 2 years ago

The idea behind this that not all telemetry data is the same with regards to importance, how it is processed, and what is supported by the down stream vendors.

This approach would allow for efficient checking of data to then apply various configuration that is required by your organisation.

Some sections are raw and left as dot points pending discussion to help expand upon them.

anuraaga commented 2 years ago

This seems related to this spec issue https://github.com/open-telemetry/opentelemetry-specification/issues/2114

At a glance, classifications like sensitive, high-cardinality do seem more useful when creating attribute views than just optional and required.

/cc @lmolkova

MovieStoreGuy commented 2 years ago

the word "resource" has specific meaning in OTEL, you're overloading it by saying "resource classification". You are classifying attributes, not resources.

Both are being classified here, attributes have their own classification along with their the resource sending it. The reason being is that if you can not allow resources with PD or UGC to be exported to a vendor you can drop the resource or remove the attributes that contain PD or UGC.

Classifying the resource is important because it does not require searching / validating attributes or other fields of the resource.

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

I don't see an easy pathway to success by doing this, and I maintaining it within the schema (semantic convention) alone would create a lot of friction trying to release changes in it.

I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?

Some organisations can not allow for PD, UGC to leave the application so only implementing in the collector already restricts those users from adopting.

before an OTEP like this is accepted there needs to be a working prototype with benchmarks I currently have a rather stripped down version that I had used to test my assumptions, the scenarios that I had replicated where:

Regex filter (removing attributes that matched a pattern)

Lookup based filter (have a list of known strings to exactly match)

Classification Filtering (searching for attributes that are either PD or UGC)

Resource Filtering (removing resources if they can PD or UGC)

Using Golang's benchmark, I had got the following results:

BenchmarkBaselineFilter/Collection-Size-100-12             40220         28404 ns/op
BenchmarkLookupFilter/Collection-Size-100-12               65653         18251 ns/op
BenchmarkRegxFilter/Collection-Size-100-12                   835       1408659 ns/op
BenchmarkClassificationAttributeFilter/Collection-Size-100-12          50774         23447 ns/op
BenchmarkClassificationResourceFilter/Collection-Size-100-12        1000000000           0.0000052 ns/op

Simply performing a range over an attribute collection of size 100 was 28404 ns/op, the time being a lookup filter containing exact string to match at 18251 ns/op with classification attribute filtering taking 23447 ns/op (approx in the middle of the two benchmarks).

However, using Resource classification filtering, it can filter out resource values sub nanosecond times.

MovieStoreGuy commented 2 years ago

By the end of the day, I'll publish my mini prototype so others can play with themselves

tsloughter commented 2 years ago

I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?

Are you sure? Not only does not everyone use the collector but that would still mean sending the PII data to the collector.

tigrannajaryan commented 2 years ago

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

+1 to this. I think it is worth entertaining this alternate to see what the tradeoffs are. The SDKs already contain pre-built dictionaries of standard attributes (in Go SDK it is in the semconv package). We can include the classification in these dictionaries and avoid the need to:

Call Classify() every time the attribute is added to a span/log/metrc.
Send classification information over the network.

For non-standard attributes it may still be necessary to do what this OTEP suggests. I don't know how important it is though, perhaps standard attributes cover the majority of cases where classification matters (maybe not, I may be wrong).

I don't see an easy pathway to success by doing this, and I maintaining it within the schema (semantic convention) alone would create a lot of friction trying to release changes in it.

On the other hand by NOT making it part of the semantic conventions there is a big risk of having inconsistent behaviour in different SDKs, where each instrumentation author makes their own classification decisions.

jamesmoessis commented 2 years ago

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

Maybe there could be sensible defaults which can be overridden. The semantic convention, or the instrumentation author can't know which attributes will be sensitive because it changes between applications. For example, net.peer.ip may or may not be PII depending on whether the peer is a user's device or another server. Only the application owner can know that information.

MovieStoreGuy commented 2 years ago

As I mentioned earlier, here is the prototype I had mentioned https://github.com/MovieStoreGuy/data-classifier, it is a rather stripped down only preserving attributes and resource (without any of the values or nested fields).

+1 to this. I think it is worth entertaining this alternate to see what the tradeoffs are. The SDKs already contain pre-built dictionaries of standard attributes (in Go SDK it is in the semconv package). We can include the classification in these dictionaries and avoid the need to:

Right, I think you're referring to the Classification Values here? I am happy for them to be defined as part of the semantic convention since the values themselves should be rather static over the longevity of the project. (High Cardinality, PII, or UGC data will not stop being a problem).

To your point of:

Send classification information over the network.

In the scenario where you have mixed mode deployments of running compute nods, lambdas and users running experiments / tests locally. Not sending the classification would mean that the internal gateway / proxy collector that is forwarding the data can not filter known classifications that is not supported by the system being exported to. Assuming that application owners ignored classification filtering in their exporter, it would mean that resources or attributes that are can not be classified without impacting processing performance.

Reactive classification is extremely computational expensive with a lot of operational overhead, proactive classifications reduces the need to categorise the data in order to filter it.

Moreover, I mention within the OTEP that half the space requested (64bits) can be reserved for Vendors to enable advanced features like smaller retention windows, improve resource catalogs when filtering for telemetry (SLO resources would be more important that high cardinality resources for example). The vendor can extend their exporter to do additional actions when these classifications are set.

Classifications can be limited to the application context and no included as part of the wire datagram, however, I see a lot value including it in order to help to build central means of ensuring resource data isn't sent to the wrong place.

I don't know how important it is though, perhaps standard attributes cover the majority of cases where classification matters (maybe not, I may be wrong).

Our current adoption of Open Telemetry in Atlassian requires heavy stream process in order to filter out the standard attributes that are high cardinality, and we also wanting to extend the semantic convention to support the idea of real user monitoring (RUM) which is currently not defined in the semantic convention so being able to add classification to those attributes and resources mean we can correctly experiment with the idea without having to add additional processing requirements into a our internal telemetry streaming service.

On the other hand by NOT making it part of the semantic conventions there is a big risk of having inconsistent behaviour in different SDKs, where each instrumentation author makes their own classification decisions.

I agree, the values of the classifications has to be consistent across language and the semantic convention is the best place to define it.

MovieStoreGuy commented 2 years ago

To @jamesmoessis point, classifications are coupled to the application deployment.

Meaning that application sitting on the network edge would define net.peer.ip as PII however, an internal service behind the firewall would not consider it PII but both may consider it high cardinality.

yurishkuro commented 2 years ago

Some organisations can not allow for PD, UGC to leave the application so only implementing in the collector already restricts those users from adopting.

Are those organizations prepared to commit significant engineering capacity to developing and maintaining this non-trivial capability in every language supported by OTel? On the other hand, a deployment of the main app with collector as a sidecar communicating over local network (i.e. in practice over local unix socket) - why would that be considered "leaving the application"?

MovieStoreGuy commented 2 years ago

Hey @yurishkuro ,

Are those organizations prepared to commit significant engineering capacity to developing and maintaining this non-trivial capability in every language supported by OTel?

I am not sure your intent behind this comment and I want to understand more :) I am fully understand that a lot of the language projects are struggling to reach feature complete and adding this would add more to those backlogs however the intent of this OTEP is to discuss its merits.

I wanted to understand what part of this OTEP is non-trivial? As I understand the problem, adding classification support into Otel is at most adding less than 10 required methods into the SDK which their implementation is small. Adding the classification values to the semantic convention which is mostly auto generated stubs per language.

I can see managing the classifications within the semantic convention and how the convention would adopt it to be non trivial and these can be expanded upon further here :)

Per your commitment comment, if the change is as simple as I think then it shouldn't require a significant timeline. I am happy to take on making a collector extension to support this idea as that has been my main project that I have contributed to.

On the other hand, a deployment of the main app with collector as a sidecar communicating over local network (i.e. in practice over local unix socket)

I am not certain that communicating via unix socket is supported by the collector? Let me check that :)

Why would that be considered "leaving the application"?

Once it has been sent via the application, it is no longer the application that has control of the data and that is where the concern is.

pirgeo commented 2 years ago

I like the idea. I am, however, worried a little bit about metric aggregations, since attributes are part of the metric identity. Dropping them can require re-aggregation of data (e.g. https://github.com/open-telemetry/opentelemetry-specification/issues/1297). I am not sure how easy or hard that is to add to existing metrics SDKs or if that would require a lot of additional processing on the collector side (@jmacd, @reyang, @jsuereth, do you have thoughts on this?)

MovieStoreGuy commented 2 years ago

@pirgeo , I think you can get away with this without the need to re-aggregate.

You can follow similar principle of MapReduce to get around it, meaning:

Filter the data (Resource or Attributes)
Update any fields or resources (in the event of rewrite attributes or resources)
Aggregate and combine classification

In short, I think this could simply be a decorator on the exporter or considered an optional extension.

MovieStoreGuy commented 2 years ago

It looks like a simplified version of this has been started here https://github.com/open-telemetry/oteps/pull/100 where this proposal incapsulates all data classifications.

Perhaps that we should join these efforts?

tedsuo commented 1 year ago

@MovieStoreGuy we are marking OTEPs which are no longer being updated as stale. We will close this PR in one week. If you wish to resurrect this PR, that is fine.

MovieStoreGuy commented 1 year ago

No worries, I don't have much bandwidth to work on this now but I would love to pick this up at some point.

open-telemetry / oteps

Data Classifications for resources and attributes #187