open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.77k stars 622 forks source link

Basic protection against the unintended cardinality explosion or similar problems #2700

Open srikanthccv opened 2 years ago

srikanthccv commented 2 years ago

We should have a some basic checks in place to make sure the unintended programming mistakes that might lead to high memory usage and affect the main application performance. We may probably need to introduce some limits on number resulting metric streams. And how much time measurements are kept in memory before we give up and drop them with warning message for pull exporters. Related #874

nstawski commented 1 year ago

I would like to take this issue.

aabmass commented 1 year ago

This has been added to the spec (experimentally)! https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#cardinality-limits

@nstawski have you had a chance to look into this at all?

lzchen commented 1 year ago

I have implemented this feature for OT Rust. I would suggest splitting this PR into two parts:

  1. (higher priority) Implement the default behavior with 2000 as the stream limit. As discussed in the Python SIG on 6/29, we decided to implement this with dropping the labels that exceed the limit and aggregating the overflow data point.
  2. Implement the ability to configure the stream limit on MeterProvider level and View level.

Just as a note, (1) is a possible behavior breaking change. Users that create a large amount of data points per timeseries that exceed the limit will automatically have their labels dropped. As well, the spec does not define a specific algorithm to use for determining which labels to drop (which might be defined later). Therefore we should probably have some safeguards before we release this. We can:

  1. Inform users that this is an experimental feature (similar to what we do with other experimental features that have stable signals) so this will be future algorithm-proof.
  2. Hide the default behavior behind a feature flag or allow a configuration option for the algorithm (not ideal, I don't like the idea of additional configuration, and we also do not have anything behind feature flags currently).
  3. Possibly just release this and handle issues that come up proactively.
nstawski commented 1 year ago

@aabmass @lzchen thank you, will start working on it today.

lzchen commented 1 year ago

@nstawski

Any updates on this?

nstawski commented 1 year ago

@lzchen was a bit stuck, reached out to Srikanth and got a response from him recently. Working on it, will ping you / create a pull soon.