kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.13k stars 39.68k forks source link

Namespace controller cannot keep up with e2e namespace deletion rate #86417

Open liggitt opened 4 years ago

liggitt commented 4 years ago

What happened: e2e tests deleting namespaces timed out waiting for namespace cleanup. this was the root cause of https://github.com/kubernetes/kubernetes/issues/86181.

I graphed namespace controller lag times:

https://docs.google.com/spreadsheets/d/1hYxDyvZ9o-3T0WrOJ7LgW-sxQn62fU8RWfDjm02OoLc/edit#gid=1101829529

The namespace controller, as configured, can't keep up with the parallelism of the e2e jobs. For most tests, this doesn't fail the test because the e2e job waits at the end for namespace deletion to complete. For the GCEPD test, namespace deletion is a synchronous part of the test. Depending on where the GCEPD test fell, the controller was sometimes too backed up to finish removing the namespace in time.

Things we could do:

/sig api-machinery /cc @deads2k @msau42

Note that this blocks moving https://github.com/kubernetes/kubernetes/issues/86181 back into the main e2e (at least as written)

fedebongio commented 4 years ago

/cc @logicalhan

jtslear commented 4 years ago

Hello @liggitt and @fedebongio Bug Triage team here for the 1.18 release. This is a friendly reminder that code freeze is scheduled for 5 March. Is this issue still intended for milestone 1.18?

jtslear commented 4 years ago

Hello @kubernetes/sig-api-machinery-bugs This issue appears frozen. No movement since December. Should we push this to the next milestone?

liggitt commented 4 years ago

/milestone clear