linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.57k stars 1.27k forks source link

Healthchecks/livenessProbe using gRPC in `all-authenticated` environment with `Server` #9595

Open AlexGoris-KasparSolutions opened 1 year ago

AlexGoris-KasparSolutions commented 1 year ago

What is the issue?

I closely followed #7050 and was happy to see it was solved in 2.12.0, we took the time this week to upgrade our linkerd installation on our dev cluster (AKS using azure CNI, if that matters). Indeed HTTP/1 health checks worked flawlessly out of the box. Unfortunately I couldn't get HTTP/2 gRPC health checks working.

How can it be reproduced?

I have a basic .NET gRPC service (created as documented here) extended with a basic health check service (GrpcGreeter app extended like documented here). Locally I can call the health check service without problems.

I then deploy the docker image of this build into a namespace with has the following annotations:

config.linkerd.io/default-inbound-policy: deny
linkerd.io/inject: enabled

The Deployment resource has the following spec.containers[0].livenessProbe configured:

livenessProbe:
  grpc:
    port: 80
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Then added following service:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: grpc-greeter
  name: grpc-greeter
spec:
  ports:
    - port: 80
      targetPort: 80
  selector:
    app: grpc-greeter

And defined the following Server resource for it:

apiVersion: policy.linkerd.io/v1beta1
kind: Server
metadata:
  name: grpc-greeter
  labels:
    app: grpc-greeter
spec:
  podSelector:
    matchLabels:
      app: grpc-greeter
  port: 80

Logs, error output, etc

I can see the linkerd proxy is blocking the livenessProbe connections in the proxy's log:

[   148.375774s]  INFO ThreadId(01) inbound:accept{client.addr=172.16.36.207:39750}:server{port=80}:http{v=h2}:http{client.addr=172.16.36.207:39750 client.id="-" timestamp=2022-10-11T18:28:14.042295539Z method="POST" uri=http://172.16.36.231:80/grpc.health.v1.Health/Check version=HTTP/2.0 trace_id="" request_bytes="" user_agent="kube-probe/1.24 grpc-go/1.40.0" host=""}:rescue{client.addr=172.16.36.207:39750}: linkerd_app_core::errors::respond: Request failed error=unauthorized request on route
[   148.375780s] DEBUG ThreadId(01) inbound:accept{client.addr=172.16.36.207:39750}:server{port=80}:http{v=h2}:http{client.addr=172.16.36.207:39750 client.id="-" timestamp=2022-10-11T18:28:14.042295539Z method="POST" uri=http://172.16.36.231:80/grpc.health.v1.Health/Check version=HTTP/2.0 trace_id="" request_bytes="" user_agent="kube-probe/1.24 grpc-go/1.40.0" host=""}: linkerd_app_core::errors::respond: Handling error on gRPC connection code=The caller does not have permission to execute the specified operation

output of linkerd check -o short

Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.11.1 but the latest stable version is 2.12.1
    see https://linkerd.io/2.11/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-595c7b564-7ls6t (stable-2.11.4)
        * prometheus-77b9558b4b-4nqjm (stable-2.11.4)
        * tap-7f8f67546f-x624j (stable-2.11.4)
        * tap-injector-6b6c5c86d4-cqsv5 (stable-2.11.4)
        * web-6756f5956c-z4kdl (stable-2.11.4)
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-db56d7cb4-qm44p running  but cli running stable-2.11.1
    see https://linkerd.io/2.11/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Possible solution

No response

Additional context

Important to note is that removing the Server resource allows the livenessProbe checks to go through.

I wanted to mention that my configuration for an http livenessProbe is exactly the same as aforementioned 'how to reproduce' description, except for the service being an ASP.NET Web API with health checks enabled on the /healthz route, and the deployment's livenessProbe configured like so:

livenessProbe:
  httpGet:
    path: /healthz
    port: 80
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Would you like to work on fixing this bug?

No response

adleong commented 1 year ago

Hi @AlexGoris-KasparSolutions. While Linkerd doesn't automatically authorize traffic for grpc probes yet, you should be able to explicitly authorize these probes by creating an AuthorizationPolicy resource for them. We'll look at adding support for automatic authorization of gRPC probes in the future.

AlexGoris-KasparSolutions commented 1 year ago

@adleong thanks for the suggestion! I read through the docs of the new AuthorizationPolicy resource, and I may be missing something, but I don't see an obvious way to create a policy that identities and permits the kubelet's probe requests. More or less the same issue as described in #7050 for HTTP requests, before those were automatically allowed.

I am also on AKS, with Azure CNI networking, meaning that pods and nodes share the same address range, so I can't make an exception based on source ip.

Is there a better way I'm missing? The goal is that no other pods (besides ones allowed specifically through policies) can communicate with this gRPC service, but allow the livenessProbe requests through.

AlexGoris-KasparSolutions commented 1 year ago

So I was missing something. The new HTTPRoute and AuhorizationPolicy resources allow specifying a POST request (which all gRPC requests are) to the health check route, then authorizing those requests to anyone. Since I'm not concerned with rogue pods querying the applications health check call, this works for me.

For reference, adding the following on top of what I described in my initial post got things working:

apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
  name: grpc-greeter-health-check
spec:
  parentRefs:
    - name: grpc-greeter
      kind: Server
      group: policy.linkerd.io
  rules:
    - matches:
        - path:
            value: "/grpc.health.v1.Health/Check"
          method: POST
---
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: grpc-greeter-health-check
spec:
  targetRef:
    group: policy.linkerd.io
    kind: HTTPRoute
    name: grpc-greeter-health-check
  requiredAuthenticationRefs: []
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

pranoyk commented 1 year ago

Hey @adleong can you give me a brief of what is expected out of this issue? Then I can start working on it. Also, it looks like @AlexGoris-KasparSolutions problem was solved.

AlexGoris-KasparSolutions commented 1 year ago

Hey @adleong can you give me a brief of what is expected out of this issue? Then I can start working on it. Also, it looks like @AlexGoris-KasparSolutions problem was solved.

Well, I found a workaround, I wouldn't say the issue is fixed. As mentioned in my previous post, this workaround opens the pod up to health-check requests from any other pod, which in theory could be exploited by rogue pods. Besides that concern, it adds a considerable amount of configuration to our k8s infrastructure, whereas non-gRPC health checks are detected by linkerd-proxy and allowed automatically (as described in #7050)

pranoyk commented 1 year ago

Well, I found a workaround, I wouldn't say the issue is fixed. As mentioned in my previous post, this workaround opens the pod up to health-check requests from any other pod, which in theory could be exploited by rogue pods. Besides that concern, it adds a considerable amount of configuration to our k8s infrastructure, whereas non-gRPC health checks are detected by linkerd-proxy and allowed automatically (as described in #7050)

Cool, I will have a look at it then.