linkerd-failover extension should support HTTPRoute for pod-to-pod cluster linking

pbaranow commented 9 months ago

What problem are you trying to solve?

Linkerd 2.14 introduced pod-to-pod cluster linking, which simplifies setup and removes need of additional Load Balancers. With this change configuration of routing of traffic to pods running in a different cluster is done using HTTPRoute objects. Use of TrafficSplit is no longer supported in this setup, so it's impossible to set up a configuration where all requests are handled locally and routed to a different cluster only when local pods are unavailable (i.e. failover to different cluster).

How should the problem be solved?

Please update linkerd-failover extension to monitor and update the HTTPRoute objects in a manner similar to how it worked with TrafficSplit:

monitor HTTPRoute objects with proper label
allow setting one backend with weight 1 and other with weight 0
let linkerd-failover extension to change weights when local backend becomes unavailable

Any alternatives you've considered?

It is possible to set backend weights in big disproportion, e.g. 99 and 1, but this would not be a true failover, as 1% of the traffic would always be routed to a different cluster.

How would users interact with this feature?

This would be in-place replacement of current setup which uses TrafficSplit, so no change in how to work with this functionality.

Would you like to work on this feature?

no

wmorgan commented 9 months ago

TrafficSplit should work as before with pod-to-pod cluster linking, and the failover extension should continue to work as well. Are you seeing something different?

aaguilartablada commented 9 months ago

Hello! Let me participate in this discussion, please.

@wmorgan, although it is expected to work with TrafficSplit the same way, is it planned to include HttpRoute in Linkerd failover extension?

wmorgan commented 9 months ago

Yes, eventually we will have parity between the two implementations and we'll likely move the failover extension to use HTTPRoute which is a bit more flexible. But there is no timeline for that currently.

pbaranow commented 8 months ago

@wmorgan I'm sorry for late response. I'd need to set up my test environment again to collect some more details. For now I can only base on my memory and some short description of the setup.

We were looking into setting up a Service Mesh with pod-to-pod networking, where we would have EKS clusters running in different regions. Each of them would handle local traffic and switchover requests to other region if local service would not work. In our setup we use Apisix API Gateway as an entry to the cluster (with ALB in front of it). Apisix is inside the mesh and redirects requests to application services also inside the mesh.

When we had set up this environment with via-gateway communication (ie. like it was in 2.13), we could configure TrafficSplit and have linkerd-failover extension to switch requests, like described in Linkerd documentation: https://linkerd.io/2.14/tasks/automatic-failover/

Then we re-configured linking to use pod-to-pod communication and configured TrafficSplit. When pods on 'local' cluster were stopped, then each request comming to Apisix finished with 502 (I believe) and a message that endpoints are unavailable, even though linkerd diagnostics listed endpoints from other cluster.

Then I removed TrafficSplit object and configured HTTPRoute with 50/50 split. This way requests were split 50/50 between clusters. Then I stopped pods on 'local' cluster and requests were correctly routed to other cluster.

It seems to me then, that something is wrong, when TrafficSplit and failover is set up on clusters linked with pod-to-pod communication.

Majkel1999 commented 8 months ago

Hi! Adding up to the answer above:

I've setup the same environment, so 2 clusters linked with pod-to-pod communication. During my testing I found out:

Requests made to the mirrored service work fine
Requests made to the local service, with a TrafficSplit configured 50/50 work fine
Requests made to the service controlled by linkerd-failover stop working when the local service becomes unavailable.

Reading through the documentation, it seems that this extension relies on the Endpoints object to check pod readiness. This can't work with the pod-to-pod linking, as the service does not create nor Endpoints neither EndpointsSlice.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

bruno-notifi commented 3 months ago

Same thing is happening to me... I am managing 4 clusters.

Clusters A and B are linked using the gateway
Clusters C and D are linked using pod-to-pod.

In Resume, failover works between A and B, but not between C and D.

The Service from D is annotated with mirror.linkerd.io/exported: "remote-discovery" everything works fine. I can hit the exported service directly, but failover does not work unfortunately when the Service from C is not availiable.

Note: the documentation dictates to annotate the exported Service with mirror.linkerd.io/exported: "true"

linkerd / linkerd2