open-telemetry / opentelemetry-go-contrib

Collection of extensions for OpenTelemetry-Go.
https://opentelemetry.io/
Apache License 2.0
1.21k stars 563 forks source link

otelgrpc high metric cardinality from rpc.server.duration #3071

Closed rcrowe closed 1 year ago

rcrowe commented 1 year ago

Since upgrading to v1.12.0/0.37.0/0.6.0 & specifically https://github.com/open-telemetry/opentelemetry-go-contrib/pull/2700 we've seen an explosion in the cardinality of our metrics.

  1. Metric recorded: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/ad04d9c71cb8974374ba8e3ee454fdaaddc3e345/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L346
  2. Attributes are taken from spanInfo: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/ad04d9c71cb8974374ba8e3ee454fdaaddc3e345/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L331
  3. Peer attributes added: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/ad04d9c71cb8974374ba8e3ee454fdaaddc3e345/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L476-L479

Is the intention that these types of high cardinality issues are handled in userland, by us using a view (or blocking at the ingestion layer) to strip labels from certain metrics after we find them problematic? On one hand I can understand that being acceptable, on the other, this feels like pushing a problem downstream to consumers that will suddenly get hit by a metric that is most likely going to grow very wide.

CC @fatsheep9146

rcrowe commented 1 year ago

Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.

To avoid high cardinality, implementations should prefer the most stable of net.peer.name or 
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, 
this is likely net.peer.name as names can be recycled even across re-instantiation of a server 
with a different ip.
MrAlias commented 1 year ago

Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.

To avoid high cardinality, implementations should prefer the most stable of net.peer.name or 
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, 
this is likely net.peer.name as names can be recycled even across re-instantiation of a server 
with a different ip.

Correct, this does look to be something required by the specification. It might be worth opening an issue in the specification repository if you think this shouldn't be a required attribute.

lucasoares commented 1 year ago

How to remove this metric? It is exposing 7000 metrics in the exporter. It should be disabled as default.

regeda commented 1 year ago

After v0.38.0 release, I see the same issue with otelhttp package. We avoid this upgrade bc it blows up the memory consumption drastically.

MrAlias commented 1 year ago

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

regeda commented 1 year ago

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

Could you share an example how to drop certain attributes from the aggregation?

MrAlias commented 1 year ago

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

Could you share an example how to drop certain attributes from the aggregation?

The linked documentation has an example that should help as a basis:

20230203_085416

You'll need to update your match criteria to specifically match what you want though.

regeda commented 1 year ago

I've done with the following example (thanks @MrAlias ):


func allowedAttr(v ...string) attribute.Filter {
    m := make(map[string]struct{}, len(v))
    for _, s := range v {
        m[s] = struct{}{}
    }
    return func(kv attribute.KeyValue) bool {
        _, ok := m[string(kv.Key)]
        return ok
    }
}

func setGlobalMeterProvider() error {
    exporter, err := prometheus.New(
        prometheus.WithRegisterer(client_prom.DefaultRegisterer),
    )
    if err != nil {
        return err
    }
    global.SetMeterProvider(metric.NewMeterProvider(
        metric.WithReader(exporter),
        metric.WithView(
            metric.NewView(
                metric.Instrument{Scope: instrumentation.Scope{Name: "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"}},
                metric.Stream{
                    AttributeFilter: allowedAttr(
                        "http.method",
                        "http.status_code",
                        "http.target",
                    ),
                },
            ),
        ),
    ))
    return nil
}
regeda commented 1 year ago

Another issue :(

A new version of otelhttp/v0.38.0 uses go.opentelemetry.io/otel/semconv/v1.17.0/httpconv.ServerRequest for attributes generating https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/net/http/otelhttp/handler.go#L218.

This function writes the whole request URI (including the query string) into http.target. It affects the cardinality of the metric and the server allocates a lot of memory if the query string is constantly random.

regeda commented 1 year ago

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

MrAlias commented 1 year ago

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

  • peer port
  • user agent
  • request uri

Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.

regeda commented 1 year ago

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

  • peer port
  • user agent
  • request uri

Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.

I think that the problem is here https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/metric/internal/filter.go#L67

The filter uses the original attributes set as a key in the seen map.

func (f *filter[N]) Aggregate(measurement N, attr attribute.Set) {
    // TODO (#3006): drop stale attributes from seen. // <-- not yet implemented
    f.Lock()
    defer f.Unlock()
    fAttr, ok := f.seen[attr]
        fmt.Println(len(f.seen)) // <-- seen is growing with a random attribute set
    if !ok {
        fAttr, _ = attr.Filter(f.filter)
        f.seen[attr] = fAttr // <-- memory leak
    }
    f.aggregator.Aggregate(measurement, fAttr)
}

See https://github.com/open-telemetry/opentelemetry-go/issues/3006

this is my test case:

import (
    ...
    "math/rand"
    ...
)

func TestRandomFilterAggregate(t *testing.T) {
    f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
    for i := 0; i < 1000; i++ {
        set := attribute.NewSet(
            attribute.Int("foo", rand.Int()),
        )
        f.Aggregate(1, set)
    }
}

Another problem is in seen map[attribute.Set]attribute.Set bc attribute.Set is not comparable type then the following test will consume 2x memory even if the attribute set is the same:


func TestRandomFilterAggregate(t *testing.T) {
    f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
    for i := 0; i < 1000; i++ {
        attr := attribute.Int("foo", rand.Int()),
        set1 := attribute.NewSet(attr)
        f.Aggregate(1, set1)
        set2 := attribute.NewSet(attr)
        f.Aggregate(1, set2)
    }
}

That is my proposal https://github.com/open-telemetry/opentelemetry-go/pull/3695

lucasoares commented 1 year ago

I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.

Any way to do it? I don't want to have to do bench tests because of these third-party metrics...

MrAlias commented 1 year ago

I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.

Any way to do it? I don't want to have to do bench tests because of these third-party metrics...

https://github.com/open-telemetry/opentelemetry-go-contrib/issues/3071#issuecomment-1416137206

liufuyang commented 1 year ago

@MrAlias I think I encounter the same memory issue. So have this https://github.com/open-telemetry/opentelemetry-go/pull/3695 fixed it and does the current release includes it? Thank you :)

Ah, sorry I just saw this https://github.com/open-telemetry/opentelemetry-go/milestone/40 looks like we are very close to releasing a fix on this.

Thank you very much and I will give a test again when 0.37 is out 👍

ericwenn commented 1 year ago

Hey, I'm seeing this issue in services, both for otelhttp and otelgrpc. Filtering attributes with views and using the unreleased fix for memory consumptions in this issue alleviates the immediate issues.

However, based on reading the specification I see two, possibly, incorrect things.

  1. For HTTP (otelhttp) some attributes are required only on the client while optional on server. Specifically net.peer.port and http.client_id are not required for servers, but is collected on servers here.

  2. For gRPC (otelgrpc) the specification makes no distinction on client vs server attributes, but using http metrics as ground truth the net.peer.port attribute should not be collected for servers in gRPC either.

@MrAlias I can submit issue in specification repo for 2, but I think 1 should be fixed by the go implementation. Am I missing something?

jaronoff97 commented 1 year ago

Hello all, we're running a collector right now and we're seeing this issue cause a massive memory leak when converting the metric to prometheus. This stems from the new usage of the meter provider here in the collector.

IMO this is a massive issue because it means someone could cause OOMs/dropped data by flooding the collector with bad rpcs. Would it be possible to make the inclusion of the net.sock.peer.addr, net.sock.peer.port, etc. optional? The inclusion of these attributes are useful for traces, but cause massive cardinality explosion in metrics. Because the collector also is exporting Prometheus Histograms, the cardinality we observe is |net.sock.peer.addr| x |net.sock.peer.port| x |buckets| which results in millions of metrics created.

Users of the collector are unable to implement a custom view as has been recommended on this thread, because the code is already built with this. Given other users are experiencing this issue, it felt prudent to upstream this request to the source.

mackjmr commented 1 year ago

I experienced this when running opentelemetry-demo.

Here is a graph showing the memory usage of the collector pod before / after implementing the filter from https://github.com/open-telemetry/opentelemetry-go-contrib/issues/3071#issuecomment-1419366500 in productcatalogservice (fix pushed at 17h25):

IMO, net.sock.peer.port should be disabled by default seeing the cardinality / significant memory usage in the collector it leads to when enabled.

cc @pellared as it seems like you are working on a fix to this issue in #3730.