open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
266 stars 169 forks source link

Semantic conventions vs GDPR #128

Open pellared opened 1 year ago

pellared commented 1 year ago

Per https://github.com/open-telemetry/semantic-conventions/blob/9b455310519ec511656f91d1db0e30f5e32acd2a/specification/trace/semantic_conventions/http.md#http-client

url.full is currently Required.

However, the URL can contain sensitive data e.g. personal data (PII) in GDPR terminology (e.g. login, ID).

GDPR adds many rights to the subjects a lot of rights that may be very problematic (e.g. https://www.digitalguardian.com/blog/google-fined-57m-data-protection-watchdog-over-gdpr-violations).

Maybe it should be Recommended similarly to device.id which also has the following notice: https://github.com/open-telemetry/semantic-conventions/blob/9b455310519ec511656f91d1db0e30f5e32acd2a/specification/resource/semantic_conventions/device.md?plain=1#L17?

Maybe should add some notice that the URL can contain sensitive/personal data and one may consider to delete it using OTel Collector's attributesprocessor?

Maybe the collection of this attribute should be configurable?

The same concerns apply to db.statement.

PS. I am sorry that the issue has questions than answers.

pellared commented 1 year ago

What I suggest is:

  1. Identify attributes that has probability (which is not very low) to contain PII (Personally Identifiable Information) e.g. url.full, url.path, url.query client.address, client.socket.address, db.statement.
  2. For each of these: 2.1. Add caution note e.g. The value may store sensitive data (e.g. personal data, personally identifiable information). GDPR and data protection laws may apply, ensure you do your own due diligence. 2.2. Make sure that it is NOT Required. So that people who do not use collector may not collect them.

We could also consider adding some env var like OTEL_SEMCONV_HARDEN_OPT_IN to make it easier to disable such attributes.

Oberon00 commented 1 year ago

I see you had already contributed in that area in the past: https://github.com/open-telemetry/opentelemetry-specification/pull/1502 Could it be that these considerations got lost in the current rewrite? CC @lmolkova

pellared commented 1 year ago

@Oberon00 In https://github.com/open-telemetry/opentelemetry-specification/pull/1502 I was mostly concerned about leaking login+password (I find it a lot more critical).

I felt that mentioning GDPR/sensitive data would not bring a lot of value 2 years ago as semantic conventions were in a very early stage. I was afraid that it would cause more confusion and paralysis. Right now I see a lot of contributions in semantic conventions. I think that it is a good moment to start doing something with privacy/GDPR/data protection.

reyang commented 1 year ago

I want to add one point - exception callstacks are commonly considered as privacy data.

pellared commented 1 year ago

Today, I learned that for one of our customer uses client (mobile) instrumentation (Android, iOS). The PII data emitted via HTTP Client instrumentation is very problematic for them. Because the instrumentation works on "end-user" devices therefore it is not easy to get rid of them. Ideally they would prefer to not emit them from the devices at all (via network).

pellared commented 1 year ago

I want to add one point - exception callstacks are commonly considered as privacy data.

I think it usually leaks "internal details" of the instrumented system (most telemetry is causing leakage of some internal details). I do not think it would be easy that they leak "personal data" via exception callstack UNLESS they also contain parameter values. I am not aware of any language/ecosystem which dumps parameter values in exception callstacks.

reyang commented 1 year ago

@Oberon00 In open-telemetry/opentelemetry-specification#1502 I was mostly concerned about leaking login+password (I find it a lot more critical).

One extra point - login/password leak can happen anywhere. I'll give an example from Microsoft Azure - in many places we allow users to put "tags" (very generic, arbitrary strings that can be associated with some entities) which can be used to group or search things. And we know that some users could put sensitive information there (e.g. it could be their emails or even passwords), so we consider all of these privacy data and put lots of efforts on redaction, classification, isolation and access control.

Another typical issue is that people might put something wrong by mistake when they instrument, in a large system we do see developers making mistakes (e.g. putting user email address in an attribute named "ResourceType"). I feel schema cannot solve these problems, a centralized scanning/redaction system can provide consistent/reliable guarantee (which comes with perf cost for sure).

pellared commented 1 year ago

I feel schema cannot solve these problems

For sure it cannot solve it. It is more about Defense in Depth and making the hardening more straightforward and adding protection on more layers.

utezduyar commented 1 year ago

I really like the idea of env. variable or something similar to do best effort of anonymizing PII data. I think it is crucial to align on the best effort of not sending this data outside of client devices.

pellared commented 1 year ago

We could also consider adding some configuration/feature to the SDK that would allow attribute retraction (e.g. via exporter decorator). This could help user's were retraction using OTel Collector's attributesprocessor is problematic (like here). But without additional hints the user may still have trouble to find telemetry which has "not low" probability to contain PII or other sensitive data.

pellared commented 1 year ago

I had a conversation with @trask, and here is the summary:

Our current focus lies on the HTTP semantic convention as it's planned to be stable and can serve as an example for future semantic conventions.

During our discussion, we identified the following attributes as problematic:

  1. url.full
  2. url.path
  3. url.query

Regarding url.path and url.query, we both agreed that these attributes could be changed from Required to Recommended. We propose adding a note like: The value SHOULD be captured by default. The value may store sensitive data. GDPR and data protection laws may apply, ensure you do your own due diligence. Instrumentations SHOULD offer a way to not capture this attribute.

However, for url.full, we haven't reached a satisfactory solution yet. We discussed two possible options:

A) Apply the same approach as for url.path and url.query. Nevertheless, this would cause the telemetry to currently miss information, such as the URL scheme.

B) Implement an opt-in functionality to scrub/retract the path and query parts from the URL. However, this leads that most of the data is redundant (except for the missing URL scheme).

Personally, I lean towards option A for since I don't consider URL scheme to be critical telemetry. Moreover, we can always add url.scheme to HTTP Client later, for example, as Recommended or even Required. Lastly, adding additional scrub/retraction functionality would be more complex and bug-prone than simply not collecting an attribute.

On the other hand, we are both not sure if such attribute retraction should be at the instrumentation level, we both think that maybe the SDK should offer a more general way to suppress specific attributes. Maybe SpanProcessors (rather impossible) for spans and Views (it only contains "allow-list", no "deny-list") for metrics are the way to go? EDIT: I got a feedback from @MrAlias that doing filtering on higher level would lead the resource consumption overhead that may not be acceptable. Not collecting unwanted attributes on instrumentation level would be more efficient.

jsuereth commented 1 year ago

I'd like to have a broader OTEL-wide discussion on handling GDPR concerns. Specifically:

I'm going to raise the discussion in the next TC meeting to see if we can get alignment on a direction here and what thoughts exist from others. I suggest this is worth some good brainstorming.

pellared commented 1 year ago

Related issue https://github.com/open-telemetry/opentelemetry-go-contrib/issues/3895

pellared commented 1 year ago

@jsuereth Any update?

pellared commented 1 year ago

I just spotted

In some semantic conventions, the data collected as a span attribute could contain PII (Personally Identifiable Information). As a general guideline, do not collect this data by default.

in https://github.com/open-telemetry/opentelemetry.io/pull/3309

trask commented 1 year ago

related, we have had PII data reported a couple of times now in exception.message: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/3039

jsuereth commented 1 year ago

I think, for o11y, it's impossible to NOT collect PII. At this point our approach should be the following:

The guidance to not generate PII by default is temporary, pending better enable/disable features in the SDK for full user control.

pellared commented 12 months ago

SIG meetings notes.

It appears that the general agreement on how to handle the problem is as follows:

  1. Include a marker for the attributes that might contain personally identifiable information (PII) or sensitive data. This way we clearly can document which attributes may need special care.
  2. Establish the _RETRACTED value, which can be employed by instrumentation libraries to remove attribute values. This way there is still a value for the Required attributes and moreover it is transparent what attributes are retracted.
pellared commented 12 months ago

After some thoughts, I find

are good enough for solving the issue.

It is also inline with my original proposal: https://github.com/open-telemetry/semantic-conventions/issues/128#issuecomment-1604073030