linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.56k stars 1.27k forks source link

The authority label in the proxy metrics has high cardinality in certain situations #5746

Open fpetkovski opened 3 years ago

fpetkovski commented 3 years ago

Bug Report

I was not sure whether to report this as a bug or a feature request. Please let me know whether I need to migrate the issue elsewhere.

What is the issue?

The authority label in the proxy metrics has high cardinality values in certain situations.

How can it be reproduced?

We have a SaaS product which handles customer traffic by using a subdomain for each tenant. We have 3000 customers at the moment which means the authority label from the proxy metrics has 3K distinct values. Coupled with the target_addr label, we easily get a dot product of 300K distinct time series. In addition, the latency metric, which is one of the golden signals, is a histogram with more than 10 buckets. This combination produces a total cardinality of >3M for just that one particular metric.

Environment

Possible solution

In the ideal case, the metric labels from the proxy would be configurable, allowing the user to decide how granular they want to go.

Another solution would be to completely exclude the authority label from the proxy metrics.

Additional context

For the time being, we decided run a custom build of the proxy with a hardcoded authority value in order to move forward with the implementation of linkerd. However, we are hoping that this issue can somehow be resolved so that we would avoid the overhead of maintaining a permanent fork.

In our situation, the authority label is really an application-specific piece of information that should be instrumented using a different system. It is fundamentally incompatible with Prometheus which is good at answering questions about the operational health of a particular service as a whole.

I also found an issue which raises a similar concern, but was closed at the time due to a lack of matching use cases: https://github.com/linkerd/linkerd2/issues/1378

adleong commented 3 years ago

@fpetkovski this is a valid concern and something we need to address specifically for ingress, since traffic will be coming from an untrusted source.

@olix0r do you think it would make sense to omit the authority label on the inbound proxy when LINKERD2_PROXY_INGRESS_MODE is set? I would expect the ingress controller to set a service authority (or dst-override) so we probably want to keep the authority label for the outbound proxy regardless.

@fpetkovski in the meantime, an easier workaround than maintaining a fork of the proxy would be to update your Prometheus scrape config to drop the authority label by adding a labeldrop action.

fpetkovski commented 3 years ago

Hey @adleong, we tried dropping the label but this approach unfortunately did not work. Removing the authority label leads to duplicated time series because labels within a metric family are not distinct anymore. The way prometheus deals with this use case is to simply ignore each time series except the last one. As a result, the RPS metric goes down significantly and the error rate also ends up being incorrect.

olix0r commented 3 years ago

Given the constraints around label rewriting, I agree we should support a way of limiting the cardinality of authority values in metric names.

A few more questions:

marcelovcpereira commented 3 years ago

Hi @olix0r , Filip is away for some days, but I can answer those questions for you.

mmiller1 commented 3 years ago

I'll chime in given that we ended up dropping these labels in the included prometheus' scrape configurations quite some time ago for the same reasons mentioned here (we however have not experienced the issue with duplicate series that was mentioned by @fpetkovski). For us, we saw these labels exploding the number of head time series stored in memory by prometheus for both Ingress traffic and outbound traffic talking to other services, the latter was the more series impact on our deployments, so removing the labels only when the proxy is in ingress mode would not help us much. We would love to see the option to disable the labels per workload to retain the flexibility of potentially enabling them for individual services down the road.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

xdvpser commented 1 year ago

Hi! Is there any work going on it? It caused a major headache for us. It was not an option to fork linkerd-proxy project, so we end up optimizing linkerd metrics based on our needs. Ideally, there should be a native option to customize authority label to some default value or disable it completely, like in Nginx Ingress Controller project --metrics-per-host

mmiller1 commented 10 months ago

I wanted to chime back in here, the workaround of dropping the label in prometheus we had used back in 2021 either didn't work and was causing misrepresented statistics, or stopped working with newer Linkerd versions (I'm leaning towards the former now armed with a better understanding of how the label is used). We're still in a situation where several of our deployments are unable to be reasonably monitored with Linkerd metrics due to high cardinality caused by these labels. It would be awesome if we could add labels to the Service objects that would allow either specifying a static authority override, or regular expression to extract portions of the host header out to be applied as the authority.