kubernetes-sigs / scheduler-plugins

Repository for out-of-tree scheduler plugins based on scheduler framework.
Apache License 2.0
1.14k stars 529 forks source link

[Flaky test] TestTopologyMatchPlugin #573

Closed Huang-Wei closed 1 year ago

Huang-Wei commented 1 year ago

It looks TestTopologyMatchPlugin is flaky: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_scheduler-plugins/572/pull-scheduler-plugins-integration-test-master/1644622084178972672

Some log snippet:

--- FAIL: TestTopologyMatchPlugin (125.22s)
    noderesourcetopology_test.go:141: The CRD is ready to serve
    noderesourcetopology_test.go:196: Init scheduler success
    noderesourcetopology_test.go:214:  Node fake-node-1 created: &Node{ObjectMeta:{fake-node-1    59dccf35-37a9-439b-94f3-8d09905de003 1375 0 2023-04-08 08:49:03 +0000 UTC <nil> <nil> map[node:fake-node-1] map[] [] [] [{unused Update v1 2023-04-08 08:49:03 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:node":{}}}} }]},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Allocatable:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}
    noderesourcetopology_test.go:214:  Node fake-node-2 created: &Node{ObjectMeta:{fake-node-2    6207a9fa-67d9-4677-b070-0d388c1bd3b8 1376 0 2023-04-08 08:49:03 +0000 UTC <nil> <nil> map[node:fake-node-2] map[] [] [] [{unused Update v1 2023-04-08 08:49:03 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:node":{}}}} }]},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Allocatable:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}
    noderesourcetopology_test.go:218: NodeList: &NodeList{ListMeta:{ 1376  <nil>},Items:[]Node{Node{ObjectMeta:{fake-node-1    59dccf35-37a9-439b-94f3-8d09905de003 1375 0 2023-04-08 08:49:03 +0000 UTC <nil> <nil> map[node:fake-node-1] map[] [] [] [{unused Update v1 2023-04-08 08:49:03 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:node":{}}}} }]},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Allocatable:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},},Node{ObjectMeta:{fake-node-2    6207a9fa-67d9-4677-b070-0d388c1bd3b8 1376 0 2023-04-08 08:49:03 +0000 UTC <nil> <nil> map[node:fake-node-2] map[] [] [] [{unused Update v1 2023-04-08 08:49:03 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:node":{}}}} }]},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Allocatable:ResourceList{cpu: {{64 0} {<nil>} 64 DecimalSI},hugepages-2Mi: {{939524096 0} {<nil>}  BinarySI},memory: {{137438953472 0} {<nil>}  BinarySI},pods: {{32 0} {<nil>} 32 DecimalSI},vendor/nic1: {{48 0} {<nil>} 48 DecimalSI},},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},},},}
    --- FAIL: TestTopologyMatchPlugin/[18][tier1]_multi_init_containers_with_good_allocation,_multi-containers_with_cpu_over_allocation_-_not_fit (20.04s)
        noderesourcetopology_test.go:1219: Start-topology-match-test [18][tier1] multi init containers with good allocation, multi-containers with cpu over allocation - not fit
        noderesourcetopology_test.go:1229: Creating Pod "topology-aware-scheduler-pod-10"
        noderesourcetopology_test.go:1276: pod "topology-aware-scheduler-pod-10" scheduling should failed, error: timed out waiting for the condition
        noderesourcetopology_test.go:1280: Case [18][tier1] multi init containers with good allocation, multi-containers with cpu over allocation - not fit finished

@ffromani could you help find an owner to deflake it? 🙇‍♂️

Huang-Wei commented 1 year ago

Another flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_scheduler-plugins/570/pull-scheduler-plugins-integration-test-master/1645838489662525440

ffromani commented 1 year ago

got it. I'm a bit on a rush to wrap up before kubecon but I'll give some bandwidth to this task during the rest of the week.

Huang-Wei commented 1 year ago

thanks @ffromani !

ffromani commented 1 year ago

I'm looking at the failures and also setting up continuous runs in a local env of mine. The goal is of course to deflake the tests.

ffromani commented 1 year ago

/reopen

we're improving, but not there yet

k8s-ci-robot commented 1 year ago

@ffromani: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/scheduler-plugins/issues/573#issuecomment-1558905916): >/reopen > >we're improving, but not there yet Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ffromani commented 1 year ago

https://github.com/kubernetes-sigs/scheduler-plugins/pull/591 should fix this issue for good

ffromani commented 1 year ago

/close

591 seems to have fixed all known issues. Kudos to @PiotrProkop !

k8s-ci-robot commented 1 year ago

@ffromani: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/scheduler-plugins/issues/573#issuecomment-1568553763): >/close > >#591 seems to have fixed all known issues. Kudos to @PiotrProkop ! Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.