open-telemetry / opentelemetry-js

OpenTelemetry JavaScript Client
https://opentelemetry.io
Apache License 2.0
2.72k stars 792 forks source link

Ability to remove/flush unused attributes in metric instruments #2997

Open legendecas opened 2 years ago

legendecas commented 2 years ago

Removing/flushing unused dynamic attributes in metric instruments is very important to not leaking memories.

Spec issue: https://github.com/open-telemetry/opentelemetry-specification/issues/1297 Prometheus Suggestion: https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels

Metrics with labels SHOULD support a remove() method with the same signature as labels() that will remove a Child from the metric no longer exporting it, and a clear() method that removes all Children from the metric. These invalidate caching of Children.

Opening this issue to keep track of this problem.

jmacd commented 2 years ago

@legendecas There are at least four issues being discussed here and in #3105.

If you're using delta temporality, you should be able to avoid memory buildup regardless of which attributes are in use. Memory buildup is a problem for the consumer downstream, in this case.

If you're using cumulative temporality and you find a need to delete() the way Prometheus suggersts, you may instead (in the current OTel spec) use an asynchronous instrument and unregister the callback producing data for that instrument when you want to stop reporting it. Support for callback unregister was explicitly for this purpose, as (IMO) we do not want to try to define the delete operation. This was debated in https://github.com/open-telemetry/opentelemetry-specification/pull/2317.

If you're using cumulative temporality and you simply want to strip attributes, so that less memory is needed in the long term, restarting (or reconfiguring) the SDK with a View that eliminates those attributes should give the correct results. That is the sense in which https://github.com/open-telemetry/opentelemetry-specification/issues/1297 defines safe attribute removal for SDKs--if you've implemented views you already have the machinery in place to safely remove attributes.

The final potential topic here is being discussed in https://github.com/open-telemetry/opentelemetry-specification/issues/1891, question is what we want to do when cardinality is high and there are stale entries that could be removed from memory. Please join that discussion. We're able to do this with per-series start timestamps; however this complicates other things, and I'd like official guidance from Prometheus on this.

legendecas commented 2 years ago

If you're using delta temporality, you should be able to avoid memory buildup regardless of which attributes are in use. Memory buildup is a problem for the consumer downstream, in this case.

When using delta temporality with an async instrument, the SDK needs to export delta values for the async instrument thus it has to remember the last reported metric streams: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example-delta-temporality.

jmacd commented 2 years ago

Thanks @legendecas. You are correct. My statement about delta temporality only applies to synchronous instruments. This is the sense in which Lightstep would like to support a temporality preference named "stateless" that would configure delta temporality for synchronous and cumulative temporality for asynchronous instruments.

As you point out, asynchronous instruments when configured with delta temporality have the same problem as synchronous instruments configured with cumulative temporality, making async/delta have roughly the same problem as discussed in https://github.com/open-telemetry/opentelemetry-specification/issues/1891. Please head to that discussion, I will continue there with a takeaway from this one.

dyladan commented 2 years ago

@legendecas what would you think about documenting this issue and handling handling attribute removal after GA? I would like to get the GA release out before kubecon and this is the last big topic holding it.

legendecas commented 2 years ago

Given that the specification of metrics api is already declared stable, I'm fine with tagging this issue as after-GA.

asafm commented 1 year ago

I've raised an issue for adding Delete() for an instrument in the API specifications: https://github.com/open-telemetry/opentelemetry-specification/issues/3062

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

eplightning commented 1 year ago

Does this issue cover asynchronous instruments as well?

I have encountered the issue where SDK keeps exporting metrics which no longer get reported by my callbacks:

    this.objectState = capiMeter().createObservableGauge('capi.object.state', {
      valueType: ValueType.INT,
      unit: '{state}',
      description: 'The current state of an object'
    });

    this.objectState.addCallback(async (result) => {
      for (const obj of await this.objectStorage.getMetrics()) {
        result.observe(obj.active ? 1 : 0, {
          'capi.object.name': obj.name
        });
      }
    });

    // SDK will keep exporting all metrics for all attribute sets that ever got observed inside the callback, not just the recent call
pichlermarc commented 1 year ago

Does this issue cover asynchronous instruments as well?

I have encountered the issue where SDK keeps exporting metrics which no longer get reported by my callbacks:

    this.objectState = capiMeter().createObservableGauge('capi.object.state', {
      valueType: ValueType.INT,
      unit: '{state}',
      description: 'The current state of an object'
    });

    this.objectState.addCallback(async (result) => {
      for (const obj of await this.objectStorage.getMetrics()) {
        result.observe(obj.active ? 1 : 0, {
          'capi.object.name': obj.name
        });
      }
    });

    // SDK will keep exporting all metrics for all attribute sets that ever got observed inside the callback, not just the recent call

@eplightning Yep, that's also something that would be covered by it. Does it keep reporting also when un-registering the callback (calling .removeCallback() on objectState)? :thinking:

eplightning commented 1 year ago

@pichlermarc

Doesn't seem to change anything: old metrics remain. Tested on Prometheus and console exporters:

    const cb = async (result: ObservableResult<ShardMetricLabels>) => {
      for (const shard of await this.shardStorage.getMetrics()) {
        result.observe(shard.active ? 1 : 0, {
          'capi.shard.name': `${shard.name}`
        });
      }
    };

    this.metricShardState.addCallback(cb);
    console.log('registered');

    setTimeout(() => {
      this.metricShardState.removeCallback(cb);
      console.log('unregistered');
    }, 1000 * 120);
eplightning commented 1 year ago

Spec update was recently merged which I believe concerns this issue as well (at least the async instruments): https://github.com/open-telemetry/opentelemetry-specification/pull/3242

pichlermarc commented 1 year ago

Spec update was recently merged which I believe concerns this issue as well (at least the async instruments): open-telemetry/opentelemetry-specification#3242

@eplightning thanks for mentioning this and for double-checking! I created an issue to implement the spec for asynchronous instruments. #4096

I think we can keep this issue here at waiting-for-spec as we're waiting for https://github.com/open-telemetry/opentelemetry-specification/issues/1297

pkeuter commented 6 months ago

@pichlermarc I was wondering what the status is on this one. There seems to be no way (outside of restarting the application) to flush any old attribute sets that were not observed for a long time. Is there any workaround for this that you know of?

enzo-cappa commented 6 months ago

This is a problem for Kafka Consumer. Due to how they work, partitions are assigned dynamically. Also, metrics are commonly reported per partition.

Due to this issue, when a partition is unassigned OpenTelemetry will keep reporting the "last known value" for the metrics of partitions for which the consumer that is no longer assigned. This causes all kind of problems.

The only workaround we could find is to send "0" as the value for metrics corresponding to partitions that are not currently assigned. This is far from ideal, as 0 is actually a valid value. We end up having some logic when consuming the metrics so we can exclude the invalid values.

pichlermarc commented 6 months ago

@pichlermarc I was wondering what the status is on this one. There seems to be no way (outside of restarting the application) to flush any old attribute sets that were not observed for a long time. Is there any workaround for this that you know of?

@pkeuter there is currently no workaround for this.

The way forward here is to implement #4095 (which would limit memory use as old metrics age out) or drive the specification issue https://github.com/open-telemetry/opentelemetry-specification/issues/1297, then implement the outcome here.

Contributions are welcome, but keep in mind that Implementing #4095 is likely non-trivial.

asafm commented 4 months ago

How about using async instruments? Then you control which partition metrics is reported for.

pichlermarc commented 1 week ago

Removing up-for-grabs as this is not actionable without a specification to implement.