kubernetes / cloud-provider-gcp

cloud-provider-gcp contains several projects used to run Kubernetes in Google Cloud
Apache License 2.0
116 stars 208 forks source link

CCM LoadBalancer flake in 5,000 node job #753

Open BenTheElder opened 1 month ago

BenTheElder commented 1 month ago

In a 5k node CI job we have a test failure that seems to be related to loadbalancer controller in CCM failing to handle an unexpected GCP api error (Thanks @danwinship for digging into this here: https://kubernetes.slack.com/archives/CN0K3TE2C/p1723560482393589?thread_ts=1723493683.959229&cid=CN0K3TE2C)

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488

-> https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/artifacts/master-and-node-logs.link.txt -> https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/ -> https://storage.googleapis.com/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/gce-scale-cluster-master/cloud-controller-manager.log

Per @danwinship :

The CCM log shows a 502 error from a cloud API at 17:42:42.656992, and then shows E0812 18:42:37.300236 11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-9672, affinity-lb-esipp-transition, a1b2bc4622d1041aeabe57d2c40cd9bd, us-east1), err: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded an hour later (not clear if that's triggered by the e2e test doing cleanup or a separate identical timeout) So this looks like cloud-provider-gcp failing to handle an unexpected google cloud api error

/sig scalability /sig cloud-provider

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
danwinship commented 1 month ago

The CCM log shows a 502 error from a cloud API at 17:42:42.656992

I didn't want to paste the whole thing into slack, but:

E0812 17:42:42.656992      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-8109, lb-finalizer, a6fd83aa051064545afde22320536931, us-east1), err: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
E0812 17:42:42.657030      11 controller.go:298] error processing service loadbalancers-8109/lb-finalizer (retrying with exponential backoff): failed to ensure load balancer: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
I0812 17:42:42.657074      11 event.go:389] "Event occurred" object="loadbalancers-8109/lb-finalizer" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
    Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
    <html lang=en>
      <meta charset=utf-8>
      <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
      <title>Error 502 (Server Error)!!1</title>
      <style>
        *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
      </style>
      <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
      <p><b>502.</b> <ins>That’s an error.</ins>
      <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
 >
aojea commented 1 month ago

the test has one hour timeout for large clusters, it is not able to provision the loadbalancer in one hour and fail loadbalancers-9672 (edited)

the error at 502 error from a cloud API at 17:42:42.656992 is from other loadbalancer loadbalancers-8109 (edited)

E0812 17:42:42.656992      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-8109, lb-finalizer, a6fd83aa051064545afde22320536931, us-east1), err: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>

at 18:42 context start to be cancelled

E0812 18:42:37.300236      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-9672, affinity-lb-esipp-transition, a1b2bc4622d1041aeabe57d2c40cd9bd, us-east1), err: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded
E0812 18:42:37.300274      11 controller.go:298] error processing service loadbalancers-9672/affinity-lb-esipp-transition (retrying with exponential backoff): failed to ensure load balancer: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded

I see someone internally is analysing it , seems something got stuck in GCE at first sight … == infra issue

bowei commented 1 month ago

I'm checking with some people on the internal infra to see if there is anything that is happening that is out of the ordinary.

aojea commented 2 weeks ago

@bowei , independently, can we make the controller more resilient to retry or to make the failure more evident?

1 hours timeouts seems a very large operation