kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
108.2k stars 38.84k forks source link

[Flaking] [sig-node] MirrorPod when create a mirror pod without changes should successfully recreate when file is removed and recreated [NodeConformance] #122132

Open pacoxu opened 7 months ago

pacoxu commented 7 months ago

Which jobs are flaking?

https://storage.googleapis.com/k8s-triage/index.html?test=MirrorPod%20when%20create%20a%20mirror%20pod%20without%20changes

[sig-node] MirrorPod when create a mirror pod without changes should successfully recreate when file is removed and recreated [NodeConformance]

Which tests are flaking?

[FAILED] Timed out after 120.001s.
Expected
    <*fmt.wrapError | 0xc0006eeda0>: 
    expected the mirror pod "graceful-pod-1fd4f565-732f-4cac-9504-69b5e487b375-tmp-node-e2e-b221d0d6-fedora-coreos-38-20231027-3-2-gcp-x86-64" to appear: pods "graceful-pod-1fd4f565-732f-4cac-9504-69b5e487b375-tmp-node-e2e-b221d0d6-fedora-coreos-38-20231027-3-2-gcp-x86-64" not found
    {
        msg: "expected the mirror pod \"graceful-pod-1fd4f565-732f-4cac-9504-69b5e487b375-tmp-node-e2e-b221d0d6-fedora-coreos-38-20231027-3-2-gcp-x86-64\" to appear: pods \"graceful-pod-1fd4f565-732f-4cac-9504-69b5e487b375-tmp-node-e2e-b221d0d6-fedora-coreos-38-20231027-3-2-gcp-x86-64\" not found",
        err: <*errors.StatusError | 0xc0008880a0>{
            ErrStatus: {
                TypeMeta: {Kind: "", APIVersion: ""},
                ListMeta: {
                    SelfLink: "",
                    ResourceVersion: "",
                    Continue: "",
                    RemainingItemCount: nil,
                },
                Status: "Failure",
                Message: "pods \"graceful-pod-1fd4f565-732f-4cac-9504-69b5e487b375-tmp-node-e2e-b221d0d6-fedora-coreos-38-20231027-3-2-gcp-x86-64\" not found",
                Reason: "NotFound",
                Details: {
                    Name: "graceful-pod-1fd4f565-732f-4cac-9504-69b5e487b375-tmp-node-e2e-b221d0d6-fedora-coreos-38-20231027-3-2-gcp-x86-64",
                    Group: "",
                    Kind: "pods",
                    UID: "",
                    Causes: nil,
                    RetryAfterSeconds: 0,
                },
                Code: 404,
            },
        },
    }
to be nil
In [BeforeEach] at: test/e2e_node/mirror_pod_grace_period_test.go:60 @ 11/18/23 19:19:43.832

Since when has it been flaking?

NA

Testgrid link

https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-evented-pleg

https://testgrid.k8s.io/sig-release-master-informing#ci-crio-cgroupv2-node-e2e-conformance

Reason for failure (if possible)

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-crio-cgroupv1-evented-pleg/1729257505558630400

Anything else we need to know?

It may be related to https://github.com/kubernetes/kubernetes/issues/121349. The related feature is https://github.com/kubernetes/enhancements/issues/3386

After we revert EventedPLEG to alpha, it still flakes in ci-crio-cgroupv2-node-e2e-conformance

Relevant SIG(s)

/sig node

k8s-ci-robot commented 7 months ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
pacoxu commented 5 months ago
STEP: mirror pod should restart with count 1 - k8s.io/kubernetes/test/e2e_node/mirror_pod_test.go:180 @ 01/23/24 20:07:57.962
[FAILED] Timed out after 126.004s.
Expected
    <*fmt.wrapError | 0xc000a92ce0>: 
    expected the mirror pod "static-pod-f493ee91-48e0-4ead-a779-7984c9c9caaa-tmp-node-e2e-bc68fd6f-fedora-coreos-39-20240104-3-0-gcp-x86-64" to appear: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
    {
        msg: "expected the mirror pod \"static-pod-f493ee91-48e0-4ead-a779-7984c9c9caaa-tmp-node-e2e-bc68fd6f-fedora-coreos-39-20240104-3-0-gcp-x86-64\" to appear: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline",
        err: <*fmt.wrapError | 0xc000a92cc0>{
            msg: "client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline",
            err: <*errors.errorString | 0xc000815a00>{
                s: "rate: Wait(n=1) would exceed context deadline",
            },
        },
    }
to be nil

BTW, the static pod becomes running seconds later(2-5s in recent flaking CIs).

pacoxu commented 3 months ago

flake once in https://testgrid.k8s.io/sig-release-master-informing#ci-crio-cgroupv1-node-e2e-conformance /cc @harche @SergeyKanzhelev

haircommander commented 1 month ago

/assign @harche @rphillips

harche commented 1 month ago

I do not see that test flaking in the recent runs at all https://testgrid.k8s.io/sig-release-master-informing#ci-crio-cgroupv2-node-e2e-conformance

mimowo commented 1 week ago

It flaked again today: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/125510/pull-kubernetes-node-e2e-containerd/1803712091828260864 on an unrelated branch.

The graph shows it happens time-to-time https://storage.googleapis.com/k8s-triage/index.html?test=should%20successfully%20recreate%20when%20file%20is%20removed%20and%20recreated

image