Closed drewhagen closed 1 month ago
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
👋 Hello! I saw this flake as a one-off on our board about a week ago and waited to see if it happened again. but just noticed in triage that the error may be a pattern across other jobs. (see link above)
@kubernetes/sig-node-bugs The first release cut (1.32.0-alpha.1) is due in less than a week from today on Oct 1st 2024. To be safe, I wanted to check if this issue would or should block and delay release cuts, particularly this first alpha version? Please advise - thank you!
cc @seans3
For the failures on 9/22/2024 and 9/19/2024, looks like the Pod never became ready
expected pod to be running and ready, got instead
...
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-09-22T08:26:27Z"
status: "False"
type: PodReadyToStartContainers
and...
containerStatuses:
- image: registry.k8s.io/pause:3.10
imageID: ""
lastState: {}
name: test
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreating```
On 9/22 where do you see the pod not becoming ready?
Also I don't see the failure on 9/16, do you mean 9/16, or some other job?
I don't see the "expected pod to be running and ready, got instead" in either of those.
this looks like it's since been fixed, should we keep this? /triage needs-information
This test has not failed in over 2 weeks, so I'm marking this closed. Please re-open if this becomes an issue again.
/close
@seans3: Closing this issue.
Which jobs are flaking?
Which tests are flaking?
Kubernetes e2e suite.[It] [sig-node] Pods should support retrieving logs from the container over websockets [NodeConformance] [Conformance]
Since when has it been flaking? Failed runs:
Time: 09/16/2024 07:40 UTC -5 Prow link
Looks like only one failure on our board as seen in Triage: ci-kubernetes-gce-conformance-latest-kubetest2
But there are some of these same failures across other builds, see here.
Testgrid link
https://testgrid.k8s.io/sig-release-master-blocking#Conformance%20-%20GCE%20-%20master%20-%20kubetest2
Reason for failure (if possible)
Observed:
The websocket connection failed to establish, likely due to transient network issues or connection timeouts in the CI environment. As a result, the logs from the container could not be retrieved, causing the test to fail.
Opening this issue to be safe, but perhaps we close this if it's not continued.
Anything else we need to know?
The log retrieval failure could be caused by intermittent network conditions or resource misconfigurations in the CI environment. Additionally, SCP errors showed that logs were not present or inaccessible on the nodes:
These issues might suggest node failures or problems with the services running on the test nodes, impacting the ability to retrieve logs.
Relevant SIG(s)
/sig node
/kind flake
cc @kubernetes/release-team-release-signal