Closed mgs255 closed 3 months ago
Hi @mgs255! I've been trying to reproduce this but I haven't seen the behavior you described. Of course, I don't have your full cms-api environment, but I've set up some cms-api
and cms-api-canary
services which point to emojivoto and bb in order to test your HTTPRoute.
More specifically, I installed emojivoto:
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -
and then apply the following manifests in the emojivoto namespace:
---
apiVersion: v1
kind: Service
metadata:
name: cms-api
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 8080
selector:
app: web-svc
---
apiVersion: v1
kind: Service
metadata:
name: cms-api-canary
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 8080
selector:
app: bb
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: bb
spec:
replicas: 1
selector:
matchLabels:
app: bb
template:
metadata:
labels:
app: bb
annotations:
linkerd.io/inject: "enabled"
spec:
containers:
- name: app
image: buoyantio/bb:v0.0.6
args:
- terminus
- "--h1-server-port=8080"
- "--response-text=hello"
ports:
- containerPort: 8080
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: cms-api
spec:
parentRefs:
- group: core
kind: Service
name: cms-api
port: 80
rules:
- backendRefs:
- group: ""
kind: Service
name: cms-api
port: 80
weight: 100
- group: ""
kind: Service
name: cms-api-canary
port: 8080
weight: 100
matches:
- path:
type: PathPrefix
value: /
Then by execing into a shell in an injected pod in the emojivoto namespace, I can curl
curl cms-api:80
As expected, this will return a response from emojivoto half the time and bb half the time, but the response is always a 200.
I think we have 3 potential avenues to explore next:
cms-api
and cms-api-canary
services look like? are all of these resources (services, httproute, and the calling service) all in the same namespace? or do they cross namespace boundaries? Hi @adleong
I've managed to reproduce the issue with debug logging enabled In our environment. As I mentioned previously it is very intermittent. It was running for around 30 hours before the failure occurred. I've shared the gzipped file with you via the Linkerd slack as a DM attachment.
All 3 services and the HTTPRoute are located in the same namespace.
Hopefully this will help you get to the root cause. If there is anything else I can do to help, please let me know.
Thanks!
Hi @mgs255! Thanks so much for those logs, they were super helpful. After some investigation, I think this may be due to a race condition related to how backend services are processed by the policy controller. See: https://github.com/linkerd/linkerd2/pull/12635
Ideally, this fix will be released in this week's edge release for you to test.
@adleong that is great news, speedy work! We will keep an eye out for the next edge release.
I notice as part of that change you added some new debug logs in this area of the proxy, if we were to keep debug logging enabled for a while we evaluate the release, could you give me a steer as to the log levels we should be targeting for this policy code? I'm not very familiar with Rust's log level specification. Globally setting debug produces very verbose output.
I typically run with a log level of linkerd=debug,info
when I'm debugging with will cause linkerd modules to log at debug level but all other modules to log at info level. It's possible to refine and narrow this further, but that should be a good starting point.
@mgs255 did you get a chance to try the latest edge release with this fix? edge-24.5.4 or later should have this.
We have been seeing this issue as well, will try the fix
@wmorgan @adleong Yes, indeed! We have been running this version in all of our production environments for at least a week now. We have alerts set up to monitor for the backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found
error and we have not seen any reoccurrences of that error so far. 🤞
Thank you both for the speedy turnaround on this. Happy to keep this open for a bit longer to let it soak or close this issue now. I'll let you decide.
Awesome, thank you for the report @mgs255!
Updating to 2024.5.5 seems to be working for us as well. Haven't run into this issue since.
I will say we have noticed a surge in proxies being unable to connect to other proxies and receiving 503s (when it's in fail-fast mode) / 504s when it fails to connect, and this happened shortly after upgrading. I don't know if it's related to the upgrade at the moment though, could just be a coincidence.
Closing as we have now been running with 2024.5.5 and haven't had this re-occur since.
What is the issue?
We are currently running the linkerd edge-24.5.2 in our dev clusters and are using httproute objects to perform traffic splitting. We have been seeing some intermittent failures with request routing when there are httproute objects active. We are seeing all requests to the parent service failing with 500 errors. We first noticed this behaviour when we upgraded from edge-2024.3.2 to 2024.3.4.
In this case we have a httproute set up as follows:
When this occurs all requests to the parent service fail with 500 errors. Deleting or editing the HTTPRoute object temporarily restores service to a working state.
How can it be reproduced?
As mentioned it this has occurred only intermittently, but we started noticing when we upgraded from edge-24.3.2 to edge-24.3.4.
Logs, error output, etc
When this starts to fail we see the following error messages logged in the linkerd-proxy in the calling service - in this case a pod
cms-api-gateway
output of
linkerd check -o short
We do have prometheus metrics but it is run out of cluster
Environment
Possible solution
No response
Additional context
I've currently enabled debug logging on the service which was previously failing and will attach additional context when/if I have it.
Would you like to work on fixing this bug?
None