[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂

mmiranda96 commented 1 year ago

Which jobs are flaking?

node-kubelet-serial-containerd

Which tests are flaking?

There are multiple tests:

E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval
E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop
E2eNode Suite.[It] [sig-node] Device Manager [Serial] [Feature:DeviceManager][NodeFeature:DeviceManager] With sample device plugin [Serial] [Disruptive] should deploy pod consuming devices first but fail with admission error after kubelet restart in case device plugin hasn't re-registered
E2eNode Suite.[It] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across kubelet restarts (no pod restart, no device plugin restart)
E2eNode Suite.[It] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration)

Since when has it been flaking?

Flakes have been present for a while.

Testgrid link

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

Reason for failure (if possible)

No response

Anything else we need to know?

We run each test multiple times (3). In most cases it's only one of them that fails. This might not be a critical issue, but ideally we want a green Testgrid.

Relevant SIG(s)

/sig node

pacoxu commented 11 months ago

/kind failing-test /remove-kind flake

It keeps failing. The last success and the only success that I can see now is 11-07.

I found that this CI is in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd which is sig-node release blocking CI. If this is release blocking, we should fix it ASAP. If not, we may move this CI to another board like https://testgrid.k8s.io/sig-node-containerd. /cc @SergeyKanzhelev @mrunalp

pacoxu commented 11 months ago

link the slack thread here: https://kubernetes.slack.com/archives/C0BP8PW9G/p1700553934108539

ffromani commented 11 months ago

/cc

SergeyKanzhelev commented 11 months ago

Device manager tests are failing because of the reconnection to socket error. Not a regression.

E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop

Another known issue, not a regression. Flakes from the nature of test how it validates the throttling logic

E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval

Also not a regression, need to take a look after release

ffromani commented 11 months ago

this PR wants to reduce/remove flakes: https://github.com/kubernetes/kubernetes/pull/122024

bart0sh commented 11 months ago

Latest test run failed only density tests and only for 2 nodes out of 3:

n1-standard-2-cos-stable-109-17800-66-32-85034074 Dec 11 13:58:43.336: INFO: CPU usage of containers: container 50th% 90th% 95th% 99th% 100th% "/" N/A N/A N/A N/A N/A "runtime" 0.072 0.886 0.886 0.886 0.898 "kubelet" 0.012 0.098 0.098 0.098 0.147 ... [FAILED] CPU usage exceeding limits: container "runtime": expected 95th% usage < 0.600; got 0.886 In [It] at: test/e2e_node/resource_usage_test.go:288 @ 12/11/23 13:58:43.337
n1-standard-2-ubuntu-gke-2204-1-25-v20231206-d39e8fd2 Dec 11 14:58:22.650: INFO: CPU usage of containers: container 50th% 90th% 95th% 99th% 100th% "/" N/A N/A N/A N/A N/A "runtime" 0.003 0.757 0.757 0.757 0.779 "kubelet" 0.013 0.118 0.118 0.118 0.159 ... [FAILED] CPU usage exceeding limits: container "runtime": expected 95th% usage < 0.600; got 0.757

What's interesting is that succeeded node is configured similarly to the failed one, but runtime metrics are much better:

n1-standard-2-cos-stable-109-17800-66-32-e1854a72 Dec 11 13:56:40.371: INFO: CPU usage of containers: container 50th% 90th% 95th% 99th% 100th% "/" N/A N/A N/A N/A N/A "runtime" 0.123 0.198 0.218 0.218 0.234 "kubelet" 0.048 0.067 0.086 0.086 0.125

The only difference I can see is that one configuration requests 2 nvidia-tesla-k80 accelerators. I'm not sure if it's related to the density tests failures though.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pacoxu commented 7 months ago

/remove-lifecycle stale this is still a good umbrella for https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

/retitle [Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂

pacoxu commented 7 months ago

/triage accepted

AnishShah commented 2 months ago

sig-node CI meeting:

All child bugs for flaky tests are closed.

/close

k8s-ci-robot commented 2 months ago

@AnishShah: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/120913#issuecomment-2329621821): >sig-node CI meeting: > >All child bugs for flaky tests are closed. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / kubernetes