kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
108.41k stars 38.88k forks source link

Federated Ingress sends unnecessary cross-region requests due to low maxRPS setting #46645

Closed dgpc closed 6 years ago

dgpc commented 7 years ago

Kubernetes version (use kubectl version):

$ kubectl version --context=dev Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.3", GitCommit:"0480917b552be33e2dba47386e51decb1a211df6", GitTreeState:"clean", BuildDate:"2017-05-10T15:48:59Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

What happened:

I create a Federated Ingress with clusters in two GCE Regions (us-central1 & europe-west1).

Both clusters are created with balancingMode=rate, maxRPS=1

When more than (1 RPS * num_nodes) is being sent from clients in Europe, traffic appears to be round-robined to both the cluster in the US and the cluster in Europe.

What you expected to happen:

I expect to be able to send most traffic from European clients to the cluster in Europe, and only spill-over to the US when Europe is overloaded. To accomplish this, maxRPS needs to be set to an appropriately large value (based on our load testing), or we need to be able to use the utilization-based mode.

Right now, our clients unnecessarily have to pay a latency penalty due traffic being sent cross-region, even when we are not close to being overloaded in any region.

Perhaps the Ingress Controller could adopt a maxRPS from an annotation on the deployment, similar to the way it adopts a Health Check from the readinessProbe?

/kind bug /sig federation

dgpc commented 7 years ago

I think a 1-line code change would make the situation considerably better, there is even a TODO suggesting it: https://github.com/kubernetes/ingress/blob/master/controllers/gce/backends/backends.go#L68:L73

Changing maxRPS from 1 to math.MaxInt64 would mean that under non-overload conditions, traffic will always go to the correct region.

That would have the downside that when overloaded, it wouldn't spill over to other regions, but I think we'd prefer this in the short term.

To have our cake and eat it too, maxRPS has to be set to whatever the actual capacity of the service pods is, which needs to be configured on a per-service or per-deployment level.

nikhiljindal commented 7 years ago

cc @kubernetes/sig-federation-feature-requests

cc @csbell @nicksardo

nicksardo commented 7 years ago

Hmm, I don't think it's that easy. From my understanding, setting MaxRate or MaxRatePerInstance to MaxInt64 would result in overloading a zone or specific instance, respectively. @dgpc, what do your clusters look like? Are they running in multiple zones in the same region?

Agreed, the ideal solution would be to have a meaningful RPS value. I'm not sure if the long term goal is to have this automatically controlled via K8s or configurable.

Thoughts @thockin?

/assign

nicksardo commented 7 years ago

@dgpc, it would be subtle because it's only ingress traffic overloading a zone/backend. Kube proxy round robins traffic to all the pods in the cluster. I haven't tried this, but theoretically, if you set the service's externalTrafficPolicy flag to OnlyLocal, and you're running a few pods on several machines across a few zones, you should see a load disparity between the pods.

dgpc commented 7 years ago

@nicksardo at the moment, for the service where we are having this issue, we are running clusters in a single zone in each of multiple regions, but at some point we might move to 2 zones per region.

My understanding from experimenting with the maxRPS setting is that setting it to a large value causes traffic to go to the cluster in the correct region, and round-robin across nodes within the cluster. Indeed there is some imbalance because sometimes multiple pods for the service will get scheduled on the same node, though we might be able to avoid this using podAntiAffinity, and in any case this is still better than sending traffic cross-ocean.

thockin commented 7 years ago

As per LB experts, setting it too low results in what you're seeing (the LB thinks your region is swamped and spills over into the next). Setting it too high means you will never spill over unless you have a total outage of a region.

If we want this to be automagic, we need to get to true utilization-based balancing, which is tricky with containers (how do we report per-container utilization?) and breaks the use of ILB. We're working on better solutions, but they are not ready yet.

On Fri, Jun 2, 2017 at 1:14 PM, David Claridge notifications@github.com wrote:

@nicksardo https://github.com/nicksardo at the moment, for the service where we are having this issue, we are running clusters in a single zone of each of multiple regions, but at some point we might move to 2 zones per region.

My understanding from experimenting with the maxRPS setting is that setting it to a large value causes traffic to go to the cluster in the correct region, and round-robin across nodes within the cluster. Indeed there is some imbalance because sometimes multiple pods for the service will get scheduled on the same node, though we might be able to avoid this using podAntiAffinity, and in any case this is still better than sending traffic cross-ocean.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/46645#issuecomment-305897881, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVEORMHxE0BZDo5TxPTwPUkuoKzEUks5sAG0UgaJpZM4NqzGJ .

nicksardo commented 7 years ago

@dgpc, we discussed the options at hand: changing the default RPS value or designing a way to plumb a value to the ingress controller. Changing the default isn't viable without providing a knob to users. While you may be okay with overloading a certain region, some users may want the opposite. Designing a knob quickly becomes complex when considering k8s autoscaling...

We decided to keep the implementation as-is because users in your case are able to manually change value without the controller overwriting it. As Tim points out, we're working on better solutions for the future.

dgpc commented 7 years ago

The decision not to change the default value makes sense. I hope that at some point you can add a knob for users to make Federated Ingress helpful in a variety of use cases (or do something automagic, but indeed that seems complicated).

In the mean time we'll manage an HTTP(S) Load Balancer manually rather than use Federated Ingress, and add/remove node pools through our rollout automation.

Thanks for looking into this, I'll keep watching this ticket and hope we can migrate back to Federated Ingress at some point in the future.

irfanurrehman commented 6 years ago

This issue was labelled only for sig/multicluster and is thus moved over to kubernetes/federation#17. If this does not seem to be right, please reopen this and notify us @kubernetes/sig-multicluster-misc. /close