Closed mm4tt closed 4 years ago
/assign
Looking further at this it seems that number of LIST leases by kube-controller-manager increased significantly ~week ago in large scale tests (both gce and kubemark).
GCE:
Kubemark:
This was caused by https://github.com/kubernetes/kubernetes/pull/83520 which as a side effect enabled pagination of LIST calls when starting/restarting watchers. In some cases we're getting 400 errors from api-server for this calls which results in the symptoms described above, i.e nodes becoming not-ready and pods stuck in running but not ready state for some time.
GCE:
Kubemark:
There are actually two issues here:
This will be address in https://github.com/kubernetes/kubernetes/pull/85219
/milestone v1.17
Logs when then 400 errors occurred
api-server
I1108 06:04:52.293354 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (36.83185ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:52.297772 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (722.845µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:53.326987 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (24.957232ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:53.330431 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (652.565µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:54.345296 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (12.418055ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:54.347654 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (327.314µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:55.358410 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (9.529971ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:55.360219 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (309.114µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:56.396085 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (29.282981ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:56.417998 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (852.91µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:57.430573 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (10.74887ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:57.432807 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (393.761µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:58.444753 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (10.885615ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:58.446985 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (477.9µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:59.461802 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (12.716812ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:04:59.463989 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (353.675µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:00.476768 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (11.58891ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:00.479503 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (459.587µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:01.501615 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (16.150994ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:01.504512 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (492.404µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:02.520808 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (11.846343ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:02.523173 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (306.726µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:03.541045 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (13.59499ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:03.542988 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (360.088µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:04.553308 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (9.340846ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:04.555247 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (303.794µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:05.567381 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (11.061015ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:05.569470 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (347.25µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:06.580844 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (10.474488ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:06.582638 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (404.003µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:07.595621 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (11.964276ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:07.597837 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (325.082µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:08.610540 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (11.765717ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:08.612361 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (289.112µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:09.622875 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (9.529531ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:09.624799 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (349.58µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:10.637293 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (11.559244ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:10.639139 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (324.78µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:11.650922 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (10.646368ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:11.652891 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (444.045µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:12.663710 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (9.808894ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:12.665631 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15
934785: (306.705µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:13.675961 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?limit=500&resourceVersion=15934785: (9.436367ms) 200 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
I1108 06:05:13.677859 1 httplog.go:90] GET /apis/coordination.k8s.io/v1/leases?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTU5MzQ3ODUsInN0YXJ0Ijoia3ViZS1ub2RlLWxlYXNlL2hvbGxvdy1ub2RlLTV2OW45XHUwMDAwIn0&limit=500&resourceVersion=15934785: (347.633µs) 400 [kube-controller-manager/v1.18.0 (linux/amd64) kubernetes/62f66ea/shared-informers [::1]:40694]
kube-controller-manager
E1108 06:04:52.300415 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:53.332144 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:54.347951 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:55.360483 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:56.419097 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:57.433258 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:58.447308 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:04:59.464365 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:00.480002 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:01.508197 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:02.523549 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:03.543273 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:04.555681 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:05.569812 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:06.582999 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:07.598156 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:08.612691 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:09.625121 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:10.639433 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:11.653288 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:12.665960 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:13.678201 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:14.691691 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:15.705844 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:16.719646 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:17.731662 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:18.744551 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:19.757044 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:20.768983 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:21.783682 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:22.796288 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:23.809406 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:24.827288 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:25.840283 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:27.204398 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
E1108 06:05:28.521856 1 reflector.go:156] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Lease: specifying resource version is not allowed when using continue
So what happens here is:
410
so the lastSyncResourceVersion
doesn't get reset to ""
ResourceVersion
set to lastSyncResourceVersion
Continue
got from the first page and lastSyncResourceVersion
ResourceVersion
400
due to specifying resource version is not allowed when using continue
errorlastSyncResourceVersion
doesn't get cleared anywhere so we go back to 2 and the 400 error keeps repeating until we get 410 (resourceVersionTooOld, on the next etcd compaction) and the lastSyncResourceVersion
is set to ""
This was indeed introduced by https://github.com/kubernetes/kubernetes/pull/83520 but the actual bug is in the pager
which sets both continue
and resourceVersion
on the subsequent pages while it should set only Continue
.
I will send a PR to fix it today.
@jpbetz @liggitt @smarterclayton @wojtek-t FYI
Great finding.
This was indeed introduced by #83520 but the actual bug is in the
pager
which sets bothcontinue
andresourceVersion
on the subsequent pages while it should set onlyContinue
.
Does this mean we have no tests that actually check if a reflector works with paging?
Does this mean we have no tests that actually check if a reflector works with paging?
I see, it's the rare situation where "Watch in reflector ends with an error different than 410" that is untested. Can we set up a unit test that exercises this scenario by wrapping etcd3 storage to allow injecting an error in a watch attempt (similar to how we wrapped storage to test stale behavior in https://github.com/kubernetes/kubernetes/pull/77619/files#diff-39b0a4b10a2dce3e57aeba748879b097R1855)?
I see, it's the rare situation where "Watch in reflector ends with an error different than 410" that is untested.
I don't think it's the case - see https://github.com/kubernetes/kubernetes/pull/85272 . It clearly seems to be that pager wasn't properly tested. The problem seems to be that we're testing it with artifical stuff, whereas in real life etcd would return "bad request" errors.
The problem seems to be that we're testing it with artifical stuff, whereas in real life etcd would return "bad request" errors.
Right, I'd like to see if we can trigger the error condition in a way that the restart/paging requests go to a real storage backend.
/milestone v1.18
(v1.17 code freeze)
@josiahbjorgaard - this is a regression introduced in 1.17, I actually believe we should fix that in 1.17.
/milestone v1.17 as noted above, this is a release blocking bug
/sig scalability
Debugging below done by @mborsz:
Failed test attempt: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1193936861135900677
It failed with a number of "WaitingFor*Pods: objects timed out" errors (46 failed objects in total).
Sample failing pods:
[measurement call WaitForControlledPodsRunning - WaitForRunningJobs error: 3 objects timed out: Jobs: test-aeoqzf-1/big-job-0, test-aeoqzf-6/big-job-0, test-aeoqzf-37/big-job-0
580 nodes were not ready at 18:37:
node_lifecycle_controller said that gce-scale-cluster-minion-group-2-66d8's status hasn't been updated for 41 seconds.
apiserver's logs show that kubelet was doing 'PUT /apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/gce-scale-cluster-minion-group-2-66d8' every 10 seconds.
Something weird happened to watch by 'shared-informers':
One watch started at 18:36:52 and finished 9ms later. The next one started at 18:40:34.906921
So it seems that there was a gap in lease watche between 18:36:52 and 18:40:34.
Between 18:36:52 and 18:40:34 there is a number of leases LIST requests in the following pattern:
The kube-controller-manager is retrying every 1 second up to 18:40:34.