Closed rcrowe closed 1 year ago
Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.
To avoid high cardinality, implementations should prefer the most stable of net.peer.name or
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications,
this is likely net.peer.name as names can be recycled even across re-instantiation of a server
with a different ip.
Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.
To avoid high cardinality, implementations should prefer the most stable of net.peer.name or net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, this is likely net.peer.name as names can be recycled even across re-instantiation of a server with a different ip.
Correct, this does look to be something required by the specification. It might be worth opening an issue in the specification repository if you think this shouldn't be a required attribute.
How to remove this metric? It is exposing 7000 metrics in the exporter. It should be disabled as default.
After v0.38.0
release, I see the same issue with otelhttp
package. We avoid this upgrade bc it blows up the memory consumption drastically.
You can use a View
to drop the aggregation entirely or filter the attributes causing your cardinality explosion.
You can use a
View
to drop the aggregation entirely or filter the attributes causing your cardinality explosion.
Could you share an example how to drop certain attributes from the aggregation?
You can use a
View
to drop the aggregation entirely or filter the attributes causing your cardinality explosion.Could you share an example how to drop certain attributes from the aggregation?
The linked documentation has an example that should help as a basis:
You'll need to update your match criteria to specifically match what you want though.
I've done with the following example (thanks @MrAlias ):
func allowedAttr(v ...string) attribute.Filter {
m := make(map[string]struct{}, len(v))
for _, s := range v {
m[s] = struct{}{}
}
return func(kv attribute.KeyValue) bool {
_, ok := m[string(kv.Key)]
return ok
}
}
func setGlobalMeterProvider() error {
exporter, err := prometheus.New(
prometheus.WithRegisterer(client_prom.DefaultRegisterer),
)
if err != nil {
return err
}
global.SetMeterProvider(metric.NewMeterProvider(
metric.WithReader(exporter),
metric.WithView(
metric.NewView(
metric.Instrument{Scope: instrumentation.Scope{Name: "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"}},
metric.Stream{
AttributeFilter: allowedAttr(
"http.method",
"http.status_code",
"http.target",
),
},
),
),
))
return nil
}
Another issue :(
A new version of otelhttp/v0.38.0
uses go.opentelemetry.io/otel/semconv/v1.17.0/httpconv.ServerRequest
for attributes generating https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/net/http/otelhttp/handler.go#L218.
This function writes the whole request URI (including the query string) into http.target
. It affects the cardinality of the metric and the server allocates a lot of memory if the query string is constantly random.
Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:
Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:
- peer port
- user agent
- request uri
Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.
Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:
- peer port
- user agent
- request uri
Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.
I think that the problem is here https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/metric/internal/filter.go#L67
The filter uses the original attributes set as a key in the seen
map.
func (f *filter[N]) Aggregate(measurement N, attr attribute.Set) {
// TODO (#3006): drop stale attributes from seen. // <-- not yet implemented
f.Lock()
defer f.Unlock()
fAttr, ok := f.seen[attr]
fmt.Println(len(f.seen)) // <-- seen is growing with a random attribute set
if !ok {
fAttr, _ = attr.Filter(f.filter)
f.seen[attr] = fAttr // <-- memory leak
}
f.aggregator.Aggregate(measurement, fAttr)
}
See https://github.com/open-telemetry/opentelemetry-go/issues/3006
this is my test case:
import (
...
"math/rand"
...
)
func TestRandomFilterAggregate(t *testing.T) {
f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
for i := 0; i < 1000; i++ {
set := attribute.NewSet(
attribute.Int("foo", rand.Int()),
)
f.Aggregate(1, set)
}
}
Another problem is in seen map[attribute.Set]attribute.Set
bc attribute.Set
is not comparable type then the following test will consume 2x memory even if the attribute set is the same:
func TestRandomFilterAggregate(t *testing.T) {
f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
for i := 0; i < 1000; i++ {
attr := attribute.Int("foo", rand.Int()),
set1 := attribute.NewSet(attr)
f.Aggregate(1, set1)
set2 := attribute.NewSet(attr)
f.Aggregate(1, set2)
}
}
That is my proposal https://github.com/open-telemetry/opentelemetry-go/pull/3695
I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.
Any way to do it? I don't want to have to do bench tests because of these third-party metrics...
I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.
Any way to do it? I don't want to have to do bench tests because of these third-party metrics...
https://github.com/open-telemetry/opentelemetry-go-contrib/issues/3071#issuecomment-1416137206
@MrAlias I think I encounter the same memory issue. So have this https://github.com/open-telemetry/opentelemetry-go/pull/3695 fixed it and does the current release includes it? Thank you :)
Ah, sorry I just saw this https://github.com/open-telemetry/opentelemetry-go/milestone/40 looks like we are very close to releasing a fix on this.
Thank you very much and I will give a test again when 0.37 is out 👍
Hey, I'm seeing this issue in services, both for otelhttp
and otelgrpc
.
Filtering attributes with views and using the unreleased fix for memory consumptions in this issue alleviates the immediate issues.
However, based on reading the specification I see two, possibly, incorrect things.
For HTTP (otelhttp
) some attributes are required only on the client while optional on server. Specifically net.peer.port
and http.client_id
are not required for servers, but is collected on servers here.
For gRPC (otelgrpc
) the specification makes no distinction on client vs server attributes, but using http metrics as ground truth the net.peer.port
attribute should not be collected for servers in gRPC either.
@MrAlias I can submit issue in specification repo for 2, but I think 1 should be fixed by the go implementation. Am I missing something?
Hello all, we're running a collector right now and we're seeing this issue cause a massive memory leak when converting the metric to prometheus. This stems from the new usage of the meter provider here in the collector.
IMO this is a massive issue because it means someone could cause OOMs/dropped data by flooding the collector with bad rpcs. Would it be possible to make the inclusion of the net.sock.peer.addr
, net.sock.peer.port
, etc. optional? The inclusion of these attributes are useful for traces, but cause massive cardinality explosion in metrics. Because the collector also is exporting Prometheus Histograms, the cardinality we observe is |net.sock.peer.addr|
x |net.sock.peer.port|
x |buckets|
which results in millions of metrics created.
Users of the collector are unable to implement a custom view as has been recommended on this thread, because the code is already built with this. Given other users are experiencing this issue, it felt prudent to upstream this request to the source.
I experienced this when running opentelemetry-demo
.
Here is a graph showing the memory usage of the collector pod before / after implementing the filter from https://github.com/open-telemetry/opentelemetry-go-contrib/issues/3071#issuecomment-1419366500 in productcatalogservice (fix pushed at 17h25):
IMO, net.sock.peer.port
should be disabled by default seeing the cardinality / significant memory usage in the collector it leads to when enabled.
cc @pellared as it seems like you are working on a fix to this issue in #3730.
Since upgrading to v1.12.0/0.37.0/0.6.0 & specifically https://github.com/open-telemetry/opentelemetry-go-contrib/pull/2700 we've seen an explosion in the cardinality of our metrics.
spanInfo
: https://github.com/open-telemetry/opentelemetry-go-contrib/blob/ad04d9c71cb8974374ba8e3ee454fdaaddc3e345/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L331Is the intention that these types of high cardinality issues are handled in userland, by us using a view (or blocking at the ingestion layer) to strip labels from certain metrics after we find them problematic? On one hand I can understand that being acceptable, on the other, this feels like pushing a problem downstream to consumers that will suddenly get hit by a metric that is most likely going to grow very wide.
CC @fatsheep9146