kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
109.97k stars 39.36k forks source link

[job failure] gci-gke-ingress #55195

Closed spiffxp closed 6 years ago

spiffxp commented 6 years ago

/priority critical-urgent /priority failing-test /area platform/gke /kind bug /status approved-for-milestone @kubernetes/sig-network-test-failures FYI @kubernetes/sig-gcp-test-failures

This job has been failing since 2017-11-02. It's on the sig-release-master-blocking dashboard, and prevents us from cutting [v1.9.0-alpha.3] (kubernetes/sig-release#27). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gke-ingress

This seems like the same failures as #55189

spiffxp commented 6 years ago

ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=You cannot create more than 3 clusters in zone us-central1. To create more than 3, you must request an increase of your Google Compute Engine quota for region us-central1 to 25 CPUS or more.

Thanks to the fixes for #55189, the job is now oscillating between passing and this failure mode.

abgworrall commented 6 years ago

@krzyzacy , is this a simple quota bump, or a leak, or what ?

abgworrall commented 6 years ago

We have a theory about this (we're leaking cluster resources, which are a thing in GKE but not GCE). @krzyzacy will take a look later, and hopefully this job will become stable.

krzyzacy commented 6 years ago

(which seems not the case? gcloud container clusters list gives me nothing, or I'm doing this wrong)

dims commented 6 years ago

@abgworrall @krzyzacy one of you want to be the assignee for this bug please?

abgworrall commented 6 years ago

/assign @krzyzacy ... until the theory is proved false

krzyzacy commented 6 years ago

after https://github.com/kubernetes/test-infra/pull/5548 the boskos pool should be gradually cleaning up ancient leaked clusters. I'll let it soak and check back tomorrow.

krzyzacy commented 6 years ago

/unassign /sig network now looks like the test is panicking cc @kubernetes/sig-network-bugs

spiffxp commented 6 years ago

The "test panicked" issue is also affecting gci-gce-ingress, I'll open an issue shortly and keep this open

spiffxp commented 6 years ago

Now tracking this against v1.9.0-beta.1 (https://github.com/kubernetes/sig-release/issues/34)

k8s-github-robot commented 6 years ago

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@spiffxp @kubernetes/sig-gcp-misc @kubernetes/sig-network-misc

Action required: During code slush, issues in the milestone should be in progress. If this issue is not being actively worked on, please remove it from the milestone. If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 3 days during code slush.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels - `sig/gcp` `sig/network`: Issue will be escalated to these SIGs if needed. - `priority/critical-urgent`: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels. - `kind/bug`: Fixes a bug discovered during the current release.
Help
MrHohn commented 6 years ago

https://k8s-testgrid.appspot.com/google-gke#gci-gke-ingress Test is no longer failing, seems #56128 fixed it.

/close