heptio / aws-quickstart

AWS Kubernetes cluster via CloudFormation and kubeadm
Apache License 2.0
223 stars 134 forks source link

Unexpectedly bad performance through ELB #147

Closed smcquay closed 6 years ago

smcquay commented 6 years ago

I was following along in the expose a service tutorial and am seeing bad performance through the ELB. I'm ready to admit (and hopefully find) a configuration issue with my AWS account (I've looked and not found anything immediately wrong though), but I'm seeing incredibly bad latency getting http responses from a deployed http server.

I've placed the configuration here for validation. It's composed of a single deployment and a service of type LoadBalancer.

When I apply those files to my cluster I get the appropriate number of pods, and a single service, that is eventually populated as:

$  kubectl describe svc hw
Name:                     hw
Namespace:                default
Labels:                   <none>
Annotations:              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"name":"hw","namespace":"default"},"spec":{"ports":[{"port":8080,"protocol":"TCP"}],"s...
Selector:                 app=hw
Type:                     LoadBalancer
IP:                       10.110.59.76
LoadBalancer Ingress:     a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com
Port:                     <unset>  8080/TCP
TargetPort:               8080/TCP
NodePort:                 <unset>  30342/TCP
Endpoints:                192.168.100.133:8080,192.168.158.176:8080,192.168.26.252:8080 + 7 more...
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age   From                Message
  ----    ------                ----  ----                -------
  Normal  EnsuringLoadBalancer  11s   service-controller  Ensuring load balancer
  Normal  EnsuredLoadBalancer   9s    service-controller  Ensured load balancer

When I try to hit that url though I get one good, fast response:

$ curl http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080
{"hostname":"hw-699966676b-fgcrn","version":"v0.0.6"}

And then curl times out after a minute of waiting:

$ time curl http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080
curl: (52) Empty reply from server
curl -sS   0.01s user 0.01s system 0% cpu 1:00.18 total

When I hit the service from within the cluster I get expected (great) performance:

ubuntu@ip-10-0-13-101:~$ echo "GET http://10.110.59.76:8080" | ./vegeta attack -rate 4096 -duration 1m | ./vegeta report
Requests      [total, rate]            245760, 4096.03
Duration      [total, attack, wait]    1m0.000642423s, 59.999601462s, 1.040961ms
Latencies     [mean, 50, 95, 99, max]  407.573µs, 393.722µs, 473.528µs, 950.884µs, 7.665651ms
Bytes In      [total, mean]            13271040, 54.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:245760
Error Set:

What's odd is that if I scale down to 1 pod:

$ kc scale deploy hw --replicas=1

# wait a bit

$ echo "GET http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080" | vegeta attack -duration 1m | vegeta report
Requests      [total, rate]            3000, 50.02
Duration      [total, attack, wait]    1m0.025564982s, 59.979999s, 45.565982ms
Latencies     [mean, 50, 95, 99, max]  158.258428ms, 10.710434ms, 1.067676745s, 3.45891931s, 6.038674738s
Bytes In      [total, mean]            162000, 54.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:3000  
Error Set:

$ kc scale deploy hw --replicas=2                                                                                                                                            

# wait a bit

$ echo "GET http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080" | vegeta attack -duration 1m | vegeta report
Requests      [total, rate]            3000, 50.02
Duration      [total, attack, wait]    1m18.381401943s, 59.979999s, 18.401402943s
Latencies     [mean, 50, 95, 99, max]  370.025649ms, 11.25912ms, 1.168016052s, 6.360804811s, 31.152443366s
Bytes In      [total, mean]            161676, 53.89
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  99.80%
Status Codes  [code:count]             200:2994  0:6
Error Set:
Get http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080: net/http: timeout awaiting response headers

$ kc scale deploy hw --replicas=30

# wait a bit

$ echo "GET http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080" | vegeta attack -duration 1m | vegeta report
Requests      [total, rate]            3000, 50.02
Duration      [total, attack, wait]    1m10.062395567s, 59.979997s, 10.082398567s
Latencies     [mean, 50, 95, 99, max]  3.033458969s, 12.112792ms, 30.002734422s, 30.603028408s, 31.016907916s
Bytes In      [total, mean]            147204, 49.07
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  90.87%
Status Codes  [code:count]             200:2726  0:274
Error Set:
Get http://a2dd0c0ce087211e8a67d0266c82d580-1423280686.us-west-1.elb.amazonaws.com:8080: net/http: timeout awaiting response headers
jer commented 6 years ago

I had this exact behavior, and it is perfectly reproducible. I ended up noticing that NodePort was acting funny, and in fact was the same behavior described here https://github.com/kubernetes/kubernetes/issues/58908

Running sudo iptables -P FORWARD ACCEPT on every node fixes that issue and it also resolves the problem at the ELB. Should this be added to the CF template to make sure that Services work as expected after the Quick Start?

timothysc commented 6 years ago

This one may belong in wardroom.

/cc @craigtracey

jbeda commented 6 years ago

This is really bad -- this hit me during a demo today.

timothysc commented 6 years ago

@jbeda poked all the right people on the upstream issue, looks like it's rooted there.

timothysc commented 6 years ago

k, this comment leads me to think a change in the networking setup may have occurred - https://github.com/kubernetes/kubernetes/issues/58908#issuecomment-364302823

detiber commented 6 years ago

As I was digging through the scripts, config, and manifests that we are applying, I think it might be related to a mismatch in the default pod network configuration between kubeadm and calico, I'm working on validating this now.

smcquay commented 6 years ago

@detiber: how do I verify this change? I just tried building a cluster using the template here:

https://s3.amazonaws.com/quickstart-reference/heptio/latest/templates/kubernetes-cluster-with-new-vpc.template

and I'm still seeing the same pathology.

How would I know when this merge has hit "latest"?

detiber commented 6 years ago

@smcquay The PR to update the quickstart was merged yesterday, we are just waiting on Amazon to finish their validation before it is pushed live