Intermittent routing failures with HTTPRoute

mgs255 commented 4 months ago

What is the issue?

We are currently running the linkerd edge-24.5.2 in our dev clusters and are using httproute objects to perform traffic splitting. We have been seeing some intermittent failures with request routing when there are httproute objects active. We are seeing all requests to the parent service failing with 500 errors. We first noticed this behaviour when we upgraded from edge-2024.3.2 to 2024.3.4.

In this case we have a httproute set up as follows:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: cms-api
spec:
  parentRefs:
  - group: core
    kind: Service
    name: cms-api
    port: 80
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: cms-api
      port: 80
      weight: 100
    - group: ""
      kind: Service
      name: cms-api-canary
      port: 8080
      weight: 100
    matches:
    - path:
        type: PathPrefix
        value: /

When this occurs all requests to the parent service fail with 500 errors. Deleting or editing the HTTPRoute object temporarily restores service to a working state.

How can it be reproduced?

As mentioned it this has occurred only intermittently, but we started noticing when we upgraded from edge-24.3.2 to edge-24.3.4.

Logs, error output, etc

When this starts to fail we see the following error messages logged in the linkerd-proxy in the calling service - in this case a pod cms-api-gateway

cms-api-gateway  logical service 172.20.77.119:80: route HTTPRoute.<namespace>.cms-api: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found cms-api-canary HTTP/1.1 request failed
cms-api-gateway  logical service 172.20.77.119:80: route HTTPRoute.<namespace>.cms-api: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found cms-api HTTP/1.1 request failed

output of `linkerd check -o short`

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.2 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.5.2 but cli running edge-24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-5696fcb958-45nzv (edge-24.5.2)
    * linkerd-destination-5696fcb958-4wvnp (edge-24.5.2)
    * linkerd-destination-5696fcb958-b4d8h (edge-24.5.2)
    * linkerd-destination-5696fcb958-t7855 (edge-24.5.2)
    * linkerd-identity-5d976755c8-ppvcs (edge-24.5.2)
    * linkerd-identity-5d976755c8-v58mb (edge-24.5.2)
    * linkerd-identity-5d976755c8-w27wr (edge-24.5.2)
    * linkerd-identity-5d976755c8-x8tzh (edge-24.5.2)
    * linkerd-proxy-injector-6d665bf87f-48fc7 (edge-24.5.2)
    * linkerd-proxy-injector-6d665bf87f-f49kv (edge-24.5.2)
    * linkerd-proxy-injector-6d665bf87f-h4585 (edge-24.5.2)
    * linkerd-proxy-injector-6d665bf87f-h5mph (edge-24.5.2)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-5696fcb958-45nzv running edge-24.5.2 but cli running edge-24.5.3
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * metrics-api-7984d467f9-qmh2f (edge-24.5.2)
    * prometheus-678b478c56-kkpqn (edge-24.5.1)
    * tap-54c784f7ff-rpqms (edge-24.5.2)
    * tap-injector-677cc4795d-wlxq7 (edge-24.5.2)
    * web-7b88487cf-hck7v (edge-24.5.2)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-7984d467f9-qmh2f running edge-24.5.2 but cli running edge-24.5.3
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
× viz extension self-check
    Error calling Prometheus from the control plane: Query failed: "max(process_start_time_seconds{}) by (pod, namespace)": server_error: server error: 504
    see https://linkerd.io/2/checks/#l5d-viz-metrics-api for hints

Status check results are ×

We do have prometheus metrics but it is run out of cluster

Environment

Kubernetes: 1.29.3
Environment: EKS v1.29.3-eks-adc7111
Host: MacOS 14.4.1 (23E224)
Linkerd version: edge-24.5.2

Possible solution

No response

Additional context

I've currently enabled debug logging on the service which was previously failing and will attach additional context when/if I have it.

Would you like to work on fixing this bug?

None

adleong commented 4 months ago

Hi @mgs255! I've been trying to reproduce this but I haven't seen the behavior you described. Of course, I don't have your full cms-api environment, but I've set up some cms-api and cms-api-canary services which point to emojivoto and bb in order to test your HTTPRoute.

More specifically, I installed emojivoto:

linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

and then apply the following manifests in the emojivoto namespace:

---
apiVersion: v1
kind: Service
metadata:
  name: cms-api
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: web-svc
---
apiVersion: v1
kind: Service
metadata:
  name: cms-api-canary
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: bb
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bb
spec:
  replicas: 1
  selector:
    matchLabels:
      app: bb
  template:
    metadata:
      labels:
        app: bb
      annotations:
        linkerd.io/inject: "enabled"
    spec:
      containers:
      - name: app
        image: buoyantio/bb:v0.0.6
        args:
        - terminus
        - "--h1-server-port=8080"
        - "--response-text=hello"
        ports:
        - containerPort: 8080
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: cms-api
spec:
  parentRefs:
  - group: core
    kind: Service
    name: cms-api
    port: 80
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: cms-api
      port: 80
      weight: 100
    - group: ""
      kind: Service
      name: cms-api-canary
      port: 8080
      weight: 100
    matches:
    - path:
        type: PathPrefix
        value: /

Then by execing into a shell in an injected pod in the emojivoto namespace, I can curl

curl cms-api:80

As expected, this will return a response from emojivoto half the time and bb half the time, but the response is always a 200.

I think we have 3 potential avenues to explore next:

If you can follow my steps above and see the same behavior that I did, you may be able to identify where my test differs from your environment and help narrow down what the problem might be
If you can get full proxy logs from the calling service's (cms-api-gateway) sidecar Linkerd proxy (ideally at debug log level) from a calling service which is experiencing the 500s, there may be a clue in the logs as to what is happening
If you can give more information about your full environment, it may help me to reproduce the issue. What do the cms-api and cms-api-canary services look like? are all of these resources (services, httproute, and the calling service) all in the same namespace? or do they cross namespace boundaries?

mgs255 commented 4 months ago

Hi @adleong

I've managed to reproduce the issue with debug logging enabled In our environment. As I mentioned previously it is very intermittent. It was running for around 30 hours before the failure occurred. I've shared the gzipped file with you via the Linkerd slack as a DM attachment.

All 3 services and the HTTPRoute are located in the same namespace.

Hopefully this will help you get to the root cause. If there is anything else I can do to help, please let me know.

Thanks!

adleong commented 4 months ago

Hi @mgs255! Thanks so much for those logs, they were super helpful. After some investigation, I think this may be due to a race condition related to how backend services are processed by the policy controller. See: https://github.com/linkerd/linkerd2/pull/12635

Ideally, this fix will be released in this week's edge release for you to test.

mgs255 commented 4 months ago

@adleong that is great news, speedy work! We will keep an eye out for the next edge release.

I notice as part of that change you added some new debug logs in this area of the proxy, if we were to keep debug logging enabled for a while we evaluate the release, could you give me a steer as to the log levels we should be targeting for this policy code? I'm not very familiar with Rust's log level specification. Globally setting debug produces very verbose output.

adleong commented 4 months ago

I typically run with a log level of linkerd=debug,info when I'm debugging with will cause linkerd modules to log at debug level but all other modules to log at info level. It's possible to refine and narrow this further, but that should be a good starting point.

wmorgan commented 4 months ago

@mgs255 did you get a chance to try the latest edge release with this fix? edge-24.5.4 or later should have this.

andrewdinunzio commented 4 months ago

We have been seeing this issue as well, will try the fix

mgs255 commented 4 months ago

@wmorgan @adleong Yes, indeed! We have been running this version in all of our production environments for at least a week now. We have alerts set up to monitor for the backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found error and we have not seen any reoccurrences of that error so far. 🤞

Thank you both for the speedy turnaround on this. Happy to keep this open for a bit longer to let it soak or close this issue now. I'll let you decide.

wmorgan commented 4 months ago

Awesome, thank you for the report @mgs255!

andrewdinunzio commented 3 months ago

Updating to 2024.5.5 seems to be working for us as well. Haven't run into this issue since.

andrewdinunzio commented 3 months ago

I will say we have noticed a surge in proxies being unable to connect to other proxies and receiving 503s (when it's in fail-fast mode) / 504s when it fails to connect, and this happened shortly after upgrading. I don't know if it's related to the upgrade at the moment though, could just be a coincidence.

mgs255 commented 3 months ago

Closing as we have now been running with 2024.5.5 and haven't had this re-occur since.

linkerd / linkerd2