Provide prometheus metrics for inbound|outbound router current and maximum capacity

dmitri-lerko commented 4 years ago

Feature Request

Provide prometheus metrics for inbound|outbound router current and maximum capacity.

LINKERD2_PROXY_INBOUND_ROUTER_CAPACITY and LINKERD2_PROXY_OUTBOUND_ROUTER_CAPACITY allows for defaults to be overwritten, however, as operator I am not aware of what maximum and current values are.

https://github.com/linkerd/linkerd2-proxy/blob/2199e65ae076b6fa55a973fc4e385c4f166c9389/linkerd/app/src/env.rs#L66

What problem are you trying to solve?

When running linkerd in production, we start getting below logs from linkerd-web proxy:

inbound:accept{peer.addr=x.x.x.x:63058}:source{peer.id=None(NoPeerName(NotProvidedByRemote)) target.addr=x.x.x.x:8084}: linkerd2_app_core::errors: Failed to proxy request: router capacity reached (100)

I am worried that this is not the only instance of us hitting maximum routes, but I have no visibility on how close we are to the maximum.

How should the problem be solved?

Add 4 new timeseries. MAX for inbound and outbound and gauge for current inbound and outbound routes. This will allow users to monitor and alert upon max vs current.

Any alternatives you've considered?

Alternative is log based monitoring, however, it is reactive as we only log when issue is happening and it is already too late.

How would users interact with this feature?

Prometheus scrapers, service monitors, linkerd metrics

adleong commented 4 years ago

Hi @dmitri-lerko! There's something strange going on here: the inbound router to linkerd-web should definitely not have 100 entries. Could you place provide us with the metrics from the linkerd-web proxy? They can be retrieved with this command:

linkerd metrics -n linkerd deploy/linkerd-web

dmitri-lerko commented 4 years ago

Hi @adleong

We are running linkerd-web with config.linkerd.io/skip-inbound-ports: "8084" now to avoid this problem. I will be able to remove this workaround and gather required metrics tomorrow.

We've hit this issue in production today on our own application and it severely degraded one of our services.

[257813.984474062s]  WARN inbound:accept{peer.addr=addr2:port2}:source{peer.id=None(NoPeerName(NotProvidedByRemote)) target.addr=addr1:port1}: linkerd2_app_core::errors: Failed to proxy request: router capacity reached (100)
[257814.186403358s]  WARN inbound:accept{peer.addr=addr3:port3}:source{peer.id=None(NoPeerName(NotProvidedByRemote)) target.addr=addr1:port1}: linkerd2_app_core::errors: Failed to proxy request: router capacity reached (100)
[257815.90409615s]  WARN inbound:accept{peer.addr=addr4:port4}:source{peer.id=None(NoPeerName(NotProvidedByRemote)) target.addr=addr1:port1}: linkerd2_app_core::errors: Failed to proxy request: router capacity reached (100)

We had to un-mesh this service for the time being until a real solution can be found.

A bit sad as I am absolutely thrilled about linkerd, so I'll be keen to help to resolve this issue to the best of my ability.

adleong commented 4 years ago

We're keen to help! Definitely let us know when you get those metrics and we'll do our best to get to the bottom of this.

dmitri-lerko commented 4 years ago

Hi @adleong

Thank you for your kind support. I've removed config.linkerd.io/skip-inbound-ports: "8084" and navigated couple of pages in the dashboard until problem has re-appeared.

Please see output of linkerd metrics -n linkerd deploy/linkerd-web attached: https://gist.github.com/dmitri-lerko/2f85d97899da54b72657c85fdd1c0dc8

Please note, I've replaced first two octets with 255.255.

Thank you!

adleong commented 4 years ago

Thanks @dmitri-lerko, these metrics are fascinating. It looks like your linkerd-web pod is being accessed by requests which don't have a Host or Authority header, and use a variety of different IP address to access the pod. Thus the large number of different values for the dst label in the metrics. Each of these different dsts uses a distinct slot in the Linkerd's router cache which is why you're hitting the limit of 100.

To answer your original question, you can monitor the number of entries by simply counting the number of distinct values for the dst label.

But to dig a little deeper into the underlying problem, can you tell me a bit more about how your linkerd-web pod is exposed? Is it expected that it is accessed as many different IP addresses?

dmitri-lerko commented 4 years ago

Hi @adleong ,

Thank you for your explanation of what metrics imply. Would a below query be correct in finding instances where we are running hot on router cache?

count(count by (dst,pod) (route_request_total{direction="inbound", pod=~".*"})) by (pod) > 75

As we use prometheus federation, we have metrics from the time when we've experienced various issues with the router cache. One of the curious things that I can observe is that this metric seem to exceed 100. Does that sound right to you?

Digging deeper I've analysed linkerd-web dst labels and at this moment it seems that majority (90 out of 92) of unique dst are IPs of nodes in our cluster. Our cluster is larger than 90 nodes, but not by much. I suspect this could be due to the way Ingress in the GKE sets up load-balancers. I don't think it targets a node on which pod is running when sending an inbound traffic, but instead, it load-balances across nodes in the cluster, letting nodes to forward packets to the pod via node's iptable entries. I could be off track here, but this seems to align with our observations.

I'll appreciate any feedback on findings so far, while I am investigating further. Especially, I am curious about Google's HTTP/S LB healtchecks, I know that there are a lot of them and they all come from different IPs. However, those IPs are typically 35.x.x.x and our dst labels show internal traffic instead of external IPs.

Update After further investigation I can see that only pods exposed via Ingress seem to show high number of internal routes.

adleong commented 4 years ago

My PromQL isn't the strongest, but I think the query is something like this:

count (sum by (pod, dst) (route_request_total{direction="inbound"})) by (pod)

It sounds like these exposed services have been exposed as NodePort services and your LB is load balancing over all nodes in your cluster, as you said. I don't know the details of how configurable this is in GKE, but it would be ideal if your exposed service were of type ClusterIP or LoadBalancer. I believe this would avoid an extra network hop as well.

dmitri-lerko commented 4 years ago

Hi @adleong ,

Thank you for this.

Per GKE Ingress documentation:

notice that the type is NodePort. This is the required type for an Ingress that is used to configure an HTTP(S) load balancer

It looks like I am stuck with the NodePort networking.

Question: Is manual modification of LINKERD2_PROXY_INBOUND_ROUTER_CAPACITY and LINKERD2_PROXY_OUTBOUND_MAX_IN_FLIGHT is the only solution to this?

I could be mistaken, but 100 nodes + GKE should still be a common enough scenario - is it sensible to raise the defaults?

Update: I've applied the following ENV vars to deployments exposed via Ingress:

        - name: LINKERD2_PROXY_INBOUND_ROUTER_CAPACITY
          value: "250"
        - name: LINKERD2_PROXY_OUTBOUND_ROUTER_CAPACITY
          value: "250"

It seems to be working. Now I am hitting a new issue - as I would like to alert on saturation of router capacity, I need to know what current setting of LINKERD2_PROXY_INBOUND_ROUTER_CAPACITY and LINKERD2_PROXY_OUTBOUND_ROUTER_CAPACITY is.

Is this a reasonable Gauge metric to add to the linkerd-proxy?

adleong commented 4 years ago

I've learned a lot about the GKE Ingress controller today. Fascinating!

Adding such a gauge doesn't seem unreasonable. We're also considering increasing the default inbound router capacity.

A potentially more robust solution would be to use a dedicated ingress like nginx and have linkerd-web sit behind that. The advantage of this is that you can use skip-inbound-ports on nginx to avoid the inbound router cache issue while still keeping linkerd-web fully meshed.

dmitri-lerko commented 4 years ago

Thank you for your advice @adleong . I will evaluate nginx ingress controller, however, I will also be cautious of GKE Ingress popular being a popular choice.

Are you also considering increasing default outbound routes? I've seen some of our services hover around 80 mark, which is quite close to the maximum. What is an actual downside to having default value higher? Does it increase initial memory allication?

adleong commented 4 years ago

It doesn't increase initial memory allocation, but it does increase the maximum potential amount of memory that could be consumed.

We've increased the default to 10k in a recent PR which should be part of a Linkerd edge release soon.

adleong commented 4 years ago

The above change is now available in the latest edge release

dmitri-lerko commented 4 years ago

@adleong thank you for all your help. I tried looking into adding metrics, but I feel like I'd need to familiarise myself with rust further to make my addition of any value. With above change, risk of hitting this limit will be very very low.

adleong commented 4 years ago

👍 I'm going to close this issue now. Please re-open or get in touch if you have further problems or questions.

linkerd / linkerd2