Open brendanalexdr opened 1 year ago
RoundRobin operates on Destinations, and you've only supplied one. It sounds like another component is doing DNS or TCP load balancing underneath?
"LoadBalancingPolicy": "RoundRobin",
"Destinations": {
"destination1": {
"Address": "http://mytestwebapp:5023/"
}
RoundRobin operates on Destinations, and you've only supplied one. It sounds like another component is doing DNS or TCP load balancing underneath?
"LoadBalancingPolicy": "RoundRobin", "Destinations": { "destination1": { "Address": "http://mytestwebapp:5023/" }
Ok this is precisely why I was testing. But...in a typical clustered environment, across many nodes, and changing deployment replica counts, how do you configure destinations? So YARP cant do load balancing in a dynamic clustered environment?
FYI, DNS is being handled by windows 11 on my dev box. Got no underlying load balancing going on under the hood. Was thinking YARP would handle this.
Ok this is precisely why I was testing. But...in a typical clustered environment, across many nodes, and changing deployment replica counts, how do you configure destinations? So YARP can't do load balancing in a dynamic clustered environment?
You need a mechanism to resolve the destinations by talking to whatever is doing the dynamic clustering - such as kubernetes, which there is a YARP ingress controller for k8s. One of the reasons we have the extensibility in YARP is to enable customers to write configuration management that will pull the data from their backend systems.
Ok I go it. So, basically, if I understand correctly, in the case of my docker swarm test environment, I would need to use something like HaProxy to mediate the round robin with the microscervices.
Docker Swarm is similar to kubernetes in that it manages where the service instances live, and how to route to them. You can either use its build in routing or configure it to export that data via dns.
The part that is missing from YARP is having a DNS provider that will resolve a dns name to its addresses, and regularly poll the dns to check the addresses. YARP's config is a little confusing in that you can specify a destination via a hostname, but we expect that to resolve to a single host.
We need to have a dns provider similar to HAProxy, where you can configure the DNS and names to be resolved. YARP would then actively ping the DNS to update the host list. AFAIK there is not a notification system for DNS, so you need to poll, which means that it will always be a little out of date, depending on how often instances are created and destroyed.
Keep this open in case #2154 doesn't resolve all the issues
Similar issue here.
Sketch of the environment:
When skipping YARP by adding an nginx ingress to go directly onto the service, it works just as expected! Due to architectural reasons, this sadly does not suffice as a workaround in my case.
Maybe some sort of keep-alive that is added by YARP and makes the k8s service forward the request to the same pod all the time? Sadly I'm currently not able to properly capture traffic between YARP and the k8s service.
The k8s Service load balancing is TCP connection based, not HTTP request based, right?
YARP will reuse connections as much as possible, so you'll only get new connections when there is high concurrency. Once there are multiple connections, I assume it still prefers the first one when it's available. This can't really be fixed without moving the load balancing to YARP. The other way is to disable connection re-use but that would cause a number of issues.
Hey @Tratcher, we're running into this exact issue using YARP as our API Gateway with destinations pointed to k8s services.
We're about to test disabling the connection re-use in YARP, and I was hoping you could expand on what types of issues we may encounter. Thanks for your attention to this issue.
Disabling connection reuse will cause higher latency, resource usage, and potentially port exhaustion when under heavy load.
Thanks a lot for the response @Tratcher, much appreciated. We will try to test with the connection re-use disabled, but as you previously said, this does not seem like a viable option for a prod environment under heavy load.
If we're unable to access the k8s pods directly from YARP to make use of YARP's load balancing, it appears we may be out of options to resolve this k8s service load balancing issue.
Do you know if there are ongoing plans/efforts to release the Yarp.Kubernetes.Controller project or has this been abandoned? https://github.com/microsoft/reverse-proxy/blob/main/docs/docfx/articles/kubernetes-ingress.md
Thanks again!
That's a question for @MihaZupan.
Have you tried using the new destination resolvers feature?
services.AddReverseProxy()
+ .AddDnsDestinationResolver() // You may have to lower the frequency - default is 5 min
This would expand the list of destinations YARP sees from the hostnames (service names) to all the addresses returned by DNS. If that returned multiple available pods, YARP's round-robin load balancing should rotate between those.
@MihaZupan Thanks! That's getting us very close. I've added the DestinationResolver, and we also had to add a k8s headless service instead of using our "normal" service as a destination. Now the pod IPs are discoverable and getting set as destinations (as seen from logs added to the DnsDestinationResolver).
It looks like the last hurdle is that the requests are being routed to "PodIpAddress:443" instead of "PodIpAddress:5001". I'm working on resolving this if you have any advice, and then I think we'll have a complete solution. Thank you for the help!
Update: We'll most likely move forward with simply updating the port to 5001 for the pod IP discovered from the k8s headless service hosts. More testing needed, but so far this solution is working.
@MihaZupan @Tratcher Thanks for the help. Using the DnsDestinationResolver and k8s headless services as our destinations, we're successfully able to discover and load balance traffic to our k8s pods.
However, when the gateway is under load and a pod is restarting, we receive many 502/504 errors. In testing, we sent 2 requests per 100ms and received roughly 20-30 502/504 errors during a rolling pod restart. The system will be under a much heavier load in production.
We've tried configuring the health checks using a combination of passive and active checks, and we also tried a "FirstFail" health check policy (as outlined in the documentation). No matter how tight we make these policies, it seems it won't be possible to handle pods cycling as well as our original k8s service without YARP (which only receives one or two 504 errors during a pod cycle).
Do you have any recommendations on how to resolve or improve the problem we're seeing? Also, are there any plans to continue working on or release the Yarp.Kubernetes.Controller package ? Thanks for the help, much appreciated.
The 5xx errors are not related to YARP itself, but rather to the workloads behind the Service. There is some delay from when the workload is stopped to when the DNS or k8s Service is updated with that change. Then there's some time before YARP would poll DNS again (or get k8s Service changes) to get the new endpoint collection.
The workload needs to stay running but report an unhealthy Readiness check for a sufficiently long time to be comfortable that YARP, or any other LB fronting the service, has removed that endpoint as a destination.
This is a pretty good intro to the topic: https://learnk8s.io/graceful-shutdown
Background
I am attempting to deploy 2 replicas of a simple MyTestWebApp in a Docker Swarm envinroment using the Round Robin config. The purpose is to gain experience before deploying in production and testing the Round Robin config. (Each deployment of MyTestWepApp generates a unique app ID and when a request hits the controller it is logged in the console)
Expected Behavior
For each request to the end point, my YARP implementation will hit one instatiation of MyTestWebApp, then the second instantiation, then back to the first, and so on, in a Round Robin fashion.
The (Possible) Bug
For each request to the end point, my YARP implementation hits one instatiation only of MyTestWebApp, but no requests hit the second instantiation. If I pause making requests for a period of time (maybe 5 minutes or so), the second instantiation may be hit but then the first will not be hit.
My Config
Here is my docker compose file:
Console Logs from each instantiation
From tempwebapp in Containter 1:
From tempwebapp in Containter 2: