kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.03k stars 39.65k forks source link

[Flaky test] GCE Conformance Kubernetes e2e suite.[It] [sig-node] Pods should support retrieving logs from the container over websockets [NodeConformance] [Conformance] #127610

Closed drewhagen closed 1 month ago

drewhagen commented 1 month ago

Which jobs are flaking?

Which tests are flaking?

Kubernetes e2e suite.[It] [sig-node] Pods should support retrieving logs from the container over websockets [NodeConformance] [Conformance]

Since when has it been flaking? Failed runs:

Time: 09/16/2024 07:40 UTC -5 Prow link

Looks like only one failure on our board as seen in Triage: ci-kubernetes-gce-conformance-latest-kubetest2

But there are some of these same failures across other builds, see here.

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#Conformance%20-%20GCE%20-%20master%20-%20kubetest2

Reason for failure (if possible)

Observed:

[FAILED] Failed to open websocket to wss://34.41.155.79/api/v1/namespaces/pods-7055/pods/pod-logs-websocket-7573cd2f-65c0-4463-9b77-b90753b2b727/log?container=main: websocket.Dial wss://34.41.155.79/api/v1/namespaces/pods-7055/pods/pod-logs-websocket-7573cd2f-65c0-4463-9b77-b90753b2b727/log?container=main: dial tcp 34.41.155.79:443: connect: connection timed out

The websocket connection failed to establish, likely due to transient network issues or connection timeouts in the CI environment. As a result, the logs from the container could not be retrieved, causing the test to fail.

Opening this issue to be safe, but perhaps we close this if it's not continued.

Anything else we need to know?

The log retrieval failure could be caused by intermittent network conditions or resource misconfigurations in the CI environment. Additionally, SCP errors showed that logs were not present or inaccessible on the nodes:

usr/bin/scp: /var/log/cluster-autoscaler.log*: No such file or directory /usr/bin/scp: /var/log/fluentd.log*: No such file or directory

These issues might suggest node failures or problems with the services running on the test nodes, impacting the ability to retrieve logs.

Relevant SIG(s)

/sig node
/kind flake
cc @kubernetes/release-team-release-signal

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
drewhagen commented 1 month ago

👋 Hello! I saw this flake as a one-off on our board about a week ago and waited to see if it happened again. but just noticed in triage that the error may be a pattern across other jobs. (see link above)

@kubernetes/sig-node-bugs The first release cut (1.32.0-alpha.1) is due in less than a week from today on Oct 1st 2024. To be safe, I wanted to check if this issue would or should block and delay release cuts, particularly this first alpha version? Please advise - thank you!

liggitt commented 1 month ago

cc @seans3

seans3 commented 1 month ago

For the failures on 9/22/2024 and 9/19/2024, looks like the Pod never became ready

expected pod to be running and ready, got instead
...
status:
          conditions:
          - lastProbeTime: null
            lastTransitionTime: "2024-09-22T08:26:27Z"
            status: "False"
            type: PodReadyToStartContainers

and...


containerStatuses:
          - image: registry.k8s.io/pause:3.10
            imageID: ""
            lastState: {}
            name: test
            ready: false
            restartCount: 0
            started: false
            state:
              waiting:
                reason: ContainerCreating```
BenTheElder commented 1 month ago

On 9/22 where do you see the pod not becoming ready?

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-gce-conformance-latest-kubetest2/1835659785358282752

Also I don't see the failure on 9/16, do you mean 9/16, or some other job?

9/16: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-gce-conformance-latest-kubetest2/1835659785358282752

I don't see the "expected pod to be running and ready, got instead" in either of those.

haircommander commented 1 month ago

this looks like it's since been fixed, should we keep this? /triage needs-information

seans3 commented 1 month ago

This test has not failed in over 2 weeks, so I'm marking this closed. Please re-open if this becomes an issue again.

/close

k8s-ci-robot commented 1 month ago

@seans3: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/127610#issuecomment-2398017171): >This test has not failed in over 2 weeks, so I'm marking this closed. Please re-open if this becomes an issue again. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.