Closed hzxuzhonghu closed 6 years ago
/kind flake /area test
/sig testing
It seems rc not garbage collected after calling delete. I do not know why now, can any one help.
I1124 06:13:02.907755 19653 garbagecollector.go:408] object [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: 8715d1e5-d0de-11e7-8b29-0242ac110002]'s doesn't have an owner, continue on next item
...
--- FAIL: TestCascadingDeletion (67.05s)
testserver.go:100: Starting kube-apiserver on port 39557...
testserver.go:112: Waiting for /healthz to be ok...
garbage_collector_test.go:366: timed out waiting for the condition
/help-wanted
still seeing TestCascadingDeletion failures after #56466 merged
--- FAIL: TestCascadingDeletion (69.08s)
testserver.go:100: Starting kube-apiserver on port 40501...
testserver.go:112: Waiting for /healthz to be ok...
garbage_collector_test.go:366: timed out waiting for the condition
@kubernetes/sig-api-machinery-test-failures
Pretty sure this is a long standing flake, but I'm not yet sure yet which issue to consider it a duplicate of. Some previous reports are https://github.com/kubernetes/kubernetes/issues/28713, https://github.com/kubernetes/kubernetes/issues/27460, and https://github.com/kubernetes/kubernetes/issues/25890. The timeout for the failing condition was at some point reduced from 120 to 60 seconds in https://github.com/kubernetes/kubernetes/commit/4d2350632c4c025e1a42495ddbf4bc804bc82607 without any explanation in the commit message.
I'm guessing @caesarxuchao has some more of the history of this one to share with us. A search of prior issues reveals multiple attempted fixes.
P.S., the title and description of this issue seem inaccurate. The failure related to TestCascadingDeletion is at
garbage_collector_test.go:366: timed out waiting for the condition
Not at customresource_discovery_controller as reported.
I do not think the timeout reduce lead to more flaky. I have a thought, fake happens when Sync routine run after rc deleted and also Run routine misses the rc delete event. We can add explicit sleep after setup start two routines to see if flakes reduced.
I think @caesarxuchao has been chasing this one for a long time; I want to take a look with fresh eyes and see if I can find a root cause. I'll let @caesarxuchao decide which prior issue to re-open so we can cut down on duplicate issues.
Good news is I was able to reproduce the issue with stress
(took ~300 iterations to trigger).
@ironcladlou Previously I found two root causes and fixed them, so I closed the previous issues. Not sure what's the cause of this recent flake.
There’s a reflector backing the shared informer for replicationcontrollers, and sometimes (when this bug manifests), the reflector’s watch loop (ListWatch) is returning due to:
watch error: watch event(ok=%!s(bool=true), expectedType: *v1.ReplicationController): {ERROR &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,},Status:Failure,Message:too old resource version: 123872 (124014),Reason:Gone,Details:nil,Code:410,}}
and
k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.ReplicationController ended with: too old resource version: 129521 (129614)
When this happens, the reflector doesn’t restart ListWatch again until the full resync period hits. The net effect is the reflector isn't dispatching the rc events to the informer (and thus the GC), although given enough time (the resync period) I suspect things would self-repair.
This is perhaps somehow related to the watch cache. I first theorized that the replicationcontroller watch cache capacity was being reached sometimes, breaking the watch, but my latest debugging doesn't seem to prove it. The default capacity for replicationcontrollers in the test server setup is 100.
When this happens, the reflector doesn’t restart ListWatch again until the full resync period hits.
It seems it restarts the ListWatch after r.period, which is always set to time.Second
. (it's not the resync period).
It seems it restarts the ListWatch after r.period, which is always set to time.Second. (it's not the resync period).
Good catch! That leads me down another path...
This is the flakiest test in the repo:
Edit: pull-kubernetes-unit
is the flakiest job, this is tied as the flakiest test within that job.
this pattern seems common in the failed jobs:
W1219 16:44:47.799199 23324 reflector.go:341] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.ReplicationController ended with: too old resource version: 4291 (4860)
graph_builder.go:588] GraphBuilder process object
log message for the test.rc.1 replication controllerthe only three events we have for test.rc.1 in failed runs:
I1219 16:44:48.199297 23324 graph_builder.go:396] add virtual node.identity: [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]
I1219 16:44:48.199391 23324 garbagecollector.go:374] processing item [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]
I1219 16:44:48.209573 23324 garbagecollector.go:411] object [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]'s doesn't have an owner, continue on next item
interestingly, the virtual node addition of test.rc.1 is usually within 1 second of the watch dropping, which would be within the window where the watch was down. opened https://github.com/kubernetes/kubernetes/pull/57419 with debugging logs
Finally tracked it down. I can consistently recreate this by artificially triggering the "too old resource version" error (which is what we were seeing in all the failed test job logs) on first replication controller watch establishment. That results in the following flow:
time | pod informer | rc informer | test driver | graph builder | attemptToDelete |
0 | list pods, get [] at rv=10 | list rcs, get [] at rv=10 | |||
1 | mark synced | mark synced | |||
2 | create rc.test.1 at rv=11 | ||||
3 | create pod.test at rv=12 | ||||
4 | watch at rv=10 | watch at rv=10, get "too old" error | |||
5 | observe pod.test create at rv=12 | ||||
6 | process pod.test create at rv=12 | ||||
7 | add virtual node for rc.test.1 | ||||
8 | add node for pod.test | ||||
9 | lookup rc.test.1 | ||||
10 | find rc.test.1 at rv=11 | ||||
11 | delete rc at rv=13 | ||||
12 | list rcs, get [] at rv=13 | ||||
13 | watch at rv=13 |
That leaves us with a node in the GC graph for rc.test.1 that was confirmed by direct lookup in attemptToDelete, but never observed via informer events. Because rc.test.1 was created AND deleted during the one second watch outage, the informer will never deliver a deletion event for that object.
@liggitt If we explicitly sleep more than 1 second between create and delete, this probablity can be reduced.
@liggitt If we explicitly sleep more than 1 second, this probablity can be reduced.
where are you thinking we would add the sleep? I am more interested in tracking the virtual nodes that have not been observed via informer more rigorously to avoid them getting orphaned in the graph, rather than trying to time our way out of this bug. Watch re-establishment can take longer than one second (under some circumstances, significantly longer)
Opened https://github.com/kubernetes/kubernetes/pull/57503 with fix
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56097/pull-kubernetes-unit/67068/
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
E1120 23:43:24.614641 19542 customresource_discovery_controller.go:155] timed out waiting for caches to sync
This should not occur unless stopCh closed. But no more log. And from the ci log, we can see the kube-apiserever is up and down for many times, So I do not know why.Environment:
kubectl version
):uname -a
):