Closed Sierra1011 closed 1 month ago
Folding this into #12610
Hi @Sierra1011. That error log from the proxy indicated that it doesn't have any endpoints in the Service.APP.APP:80
backend service to route to. Can you confirm that the service exists and that it has endpoints? You can use kubectl get service
and kubectl get endpoints
to confirm this. You can also use the linkerd diagnostics endpoints
command to see Linkerd's view of what endpoints the service has, if any.
Hi @adleong, I'll set up a test similar to as described in #12610 to troubleshoot it exactly, hopefully today if nothing is on fire :crossed_fingers:
So, I deployed a full stack of emojivoto (emoji, voting, vote-bot, web) in cluster 1, and a deployment of emoji to cluster 2, with a service mirrored to cluster 1. You're right; there's no endpoints shown for the mirrored emoji service, and if I scale down the original emoji deployment, no endpoints shown at all.
Playing around running curl to emoji while running linkerd viz tap
on the respective deployments showed that it was at least hitting the relevant deployments.
So, that seems to be working fine, but I'm not in a position to go back and reimplementing our app as it was when I raised this as an issue (having received a ton of 5xx errors), but I'll try it elsewhere and come back with some more info.
OK, so it's been a fairly slow chase down on this I'm afraid.
So, I'm going to talk in real terms rather than the emojivoto service I'm deploying for funsies. I have some deployments with services on one cluster; let's call them monolith
and legacy-assets
and they live in the monolith
namespace. monolith
depends on legacy-assets
being reachable in order to start up.
I'm migrating the deployment of services from one cluster to a new cluster which is called eks-non-prod-primary
. Standard A to B stuff.
My intention is to use pod-to-pod multicluster from Linkerd and HTTPRoute
s to avoid changing config in the actual app; I can just create the HTTPRoute
and dynamically move traffic from the in-cluster service to the new cluster.
So I deploy legacy-assets to the new cluster. It's got remote-discovery enabled, so it creates a Service called legacy-assets-eks-non-prod-primary
in the monolith
namespace. I make my HTTPRoute:
apiVersion: policy.linkerd.io/v1beta3
kind: HTTPRoute
metadata:
name: legacy-assets
namespace: monolith
spec:
parentRefs:
- group: core
kind: Service
name: legacy-assets
port: 80
rules:
- backendRefs:
- group: ""
kind: Service
name: legacy-assets-eks-non-prod-primary
port: 80
weight: 100
- group: ""
kind: Service
name: legacy-assets
port: 80
weight: 0
matches:
- path:
type: PathPrefix
value: /
What should happen is all traffic goes to the other cluster. But what actually happens is I get HTTP 500 responses.
I got this from the linkerd-proxy container (adding line breaks for legibility purposes):
outbound:proxy{addr=10.100.127.47:80}:rescue{client.addr=172.27.198.240:49658}:
linkerd_app_core::errors::respond:
HTTP/1.1 request failed error=logical service 10.100.127.47:80:
route HTTPRoute.monolith.legacy-assets: backend default.fail:
HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary
error.sources=[route HTTPRoute.monolith.legacy-assets:
backend default.fail: HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary, backend default.fail:
HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary,
HTTP request configured to fail with 500 Internal Server Error:
Service not found legacy-assets-eks-non-prod-primary]
(and in one line to preserve the full error from logs)
outbound:proxy{addr=10.100.127.47:80}:rescue{client.addr=172.27.198.240:49658}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.127.47:80: route HTTPRoute.monolith.legacy-assets: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary error.sources=[route HTTPRoute.monolith.legacy-assets: backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary, backend default.fail: HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary, HTTP request configured to fail with 500 Internal Server Error: Service not found legacy-assets-eks-non-prod-primary]
The only thing I really have to go on is that we don't have nativeSidecar
enabled on these old clusters, and the new ones do. As the pod starts, the container is immediately querying the service, but if the proxy isn't ready it fails with generic networking issues.
Any suggestions to get more info out of it?
Alright, I'll hold my hands up here and say there may be a big old "but" here - I upgraded to 24.5.5 a few days ago and saw that it made its way to the top environment without issue. However, it actually got stuck on that particular cluster.
Having fixed it so we're running a later version of edge (I saw in #12610 a fix mentioned) we now are no longer seeing this error. Please ignore me while I continue testing this on the actual latest version - if I have any issues I'll come back to it.
What is the issue?
When using an
httproute
to dynamically redistribute load from oneService
to a MultiCluster mirrored Service, traffic only intermittently transmits correctly.How can it be reproduced?
east
andwest
, joined by a multicluster link that mirrors appropriately labelled services deployed inwest
intoeast
foo
in clustereast
(but no deployment to receive traffic)east
calledfoo-west
. This should pass traffic to a deployment of something will return basic acks e.g. curls.foo
to backendReffoo-east
.Logs, error output, etc
Application curl logs:
Proxy sidecar:
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
maybe