kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.91k stars 39.61k forks source link

[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 #120913

Closed mmiranda96 closed 2 months ago

mmiranda96 commented 1 year ago

Which jobs are flaking?

node-kubelet-serial-containerd

Which tests are flaking?

There are multiple tests:

Since when has it been flaking?

Flakes have been present for a while.

Testgrid link

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

Reason for failure (if possible)

No response

Anything else we need to know?

We run each test multiple times (3). In most cases it's only one of them that fails. This might not be a critical issue, but ideally we want a green Testgrid.

Relevant SIG(s)

/sig node

pacoxu commented 11 months ago

/kind failing-test /remove-kind flake

It keeps failing. The last success and the only success that I can see now is 11-07.

I found that this CI is in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd which is sig-node release blocking CI. If this is release blocking, we should fix it ASAP. If not, we may move this CI to another board like https://testgrid.k8s.io/sig-node-containerd. /cc @SergeyKanzhelev @mrunalp

pacoxu commented 11 months ago

link the slack thread here: https://kubernetes.slack.com/archives/C0BP8PW9G/p1700553934108539

ffromani commented 11 months ago

/cc

SergeyKanzhelev commented 11 months ago

Device manager tests are failing because of the reconnection to socket error. Not a regression.

E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop

Another known issue, not a regression. Flakes from the nature of test how it validates the throttling logic

E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval

Also not a regression, need to take a look after release

ffromani commented 11 months ago

this PR wants to reduce/remove flakes: https://github.com/kubernetes/kubernetes/pull/122024

bart0sh commented 11 months ago

Latest test run failed only density tests and only for 2 nodes out of 3:

What's interesting is that succeeded node is configured similarly to the failed one, but runtime metrics are much better:

The only difference I can see is that one configuration requests 2 nvidia-tesla-k80 accelerators. I'm not sure if it's related to the density tests failures though.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pacoxu commented 7 months ago

/remove-lifecycle stale this is still a good umbrella for https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

/retitle [Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂

pacoxu commented 7 months ago

/triage accepted

AnishShah commented 2 months ago

sig-node CI meeting:

All child bugs for flaky tests are closed.

/close

k8s-ci-robot commented 2 months ago

@AnishShah: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/120913#issuecomment-2329621821): >sig-node CI meeting: > >All child bugs for flaky tests are closed. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.