karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.37k stars 865 forks source link

flake test: daemonSet status collection testing failed #5257

Closed chaosi-zju closed 3 weeks ago

chaosi-zju commented 1 month ago

Which jobs are flaking:

https://github.com/karmada-io/karmada/actions/runs/10105441369/job/27946023672?pr=5255

Which test(s) are flaking:

• [FAILED] [300.028 seconds]
[resource-status collection] resource status collection testing DaemonSetStatus collection testing [It] daemonSet status collection testing
/home/runner/work/karmada/karmada/test/e2e/resource_test.go:441

  Captured StdOut/StdErr Output >>
  I0726 04:10:56.075240   47860 resource_test.go:445] Waiting for daemonSet(karmadatest-x298m/daemonset-9strv) collecting correctly status
  I0726 04:10:56.082780   47860 resource_test.go:450] daemonSet(karmadatest-x298m/daemonset-9strv) replicas: 0, wanted replicas: 3
  I0726 04:11:01.087288   47860 resource_test.go:450] daemonSet(karmadatest-x298m/daemonset-9strv) replicas: 1, wanted replicas: 3
  I0726 04:11:06.091751   47860 resource_test.go:450] daemonSet(karmadatest-x298m/daemonset-9strv) replicas: 1, wanted replicas: 3
  I0726 04:11:11.096436   47860 resource_test.go:450] daemonSet(karmadatest-x298m/daemonset-9strv) replicas: 1, wanted replicas: 3
 ......
  << Captured StdOut/StdErr Output

  Timeline >>
  STEP: Creating DaemonSet(karmadatest-x298m/daemonset-9strv) @ 07/26/24 04:10:56.058
  STEP: Creating PropagationPolicy(karmadatest-x298m/daemonset-9strv) @ 07/26/24 04:10:56.065
  STEP: check whether the daemonSet status can be correctly collected @ 07/26/24 04:10:56.075
  [FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/resource_test.go:460 @ 07/26/24 04:15:56.076
  STEP: Removing PropagationPolicy(karmadatest-x298m/daemonset-9strv) @ 07/26/24 04:15:56.077
  STEP: Removing DaemonSet(karmadatest-x298m/daemonset-9strv) @ 07/26/24 04:15:56.082
  << Timeline

  [FAILED] Timed out after 300.001s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [It] at: /home/runner/work/karmada/karmada/test/e2e/resource_test.go:460 @ 07/26/24 04:15:56.076

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e.init.func40.7.3.1()
        /home/runner/work/karmada/karmada/test/e2e/resource_test.go:460 +0x46b
    github.com/karmada-io/karmada/test/e2e.init.func40.7.3()
        /home/runner/work/karmada/karmada/test/e2e/resource_test.go:442 +0x10a

Reason for failure:

the desired replicas of this daemonset is 3, actually the currentNumberScheduled is already 3, but 2 of them is unavailable, we mistakenly thought that the status collection failed.

2024-07-26T04:10:57.315030082Z stderr F I0726 04:10:57.314951       1 aggregatestatus.go:291] Grab daemonSet(karmadatest-x298m/daemonset-9strv) status from cluster(member1), currentNumberScheduled: 1, desiredNumberScheduled: 1, numberAvailable: 1, numberMisscheduled: 0, numberReady: 1, updatedNumberScheduled: 1, numberUnavailable: 0
2024-07-26T04:10:57.315038518Z stderr F I0726 04:10:57.314972       1 aggregatestatus.go:291] Grab daemonSet(karmadatest-x298m/daemonset-9strv) status from cluster(member2), currentNumberScheduled: 1, desiredNumberScheduled: 1, numberAvailable: 0, numberMisscheduled: 0, numberReady: 0, updatedNumberScheduled: 1, numberUnavailable: 1
2024-07-26T04:10:57.315043327Z stderr F I0726 04:10:57.314981       1 aggregatestatus.go:291] Grab daemonSet(karmadatest-x298m/daemonset-9strv) status from cluster(member3), currentNumberScheduled: 1, desiredNumberScheduled: 1, numberAvailable: 0, numberMisscheduled: 0, numberReady: 0, updatedNumberScheduled: 1, numberUnavailable: 1

Anything else we need to know:

chaosi-zju commented 1 month ago

As for the purpose of verifying collected daemonSet status result, we just check status.currentNumberScheduled maybe is enough? No need to cost time to wait every pod available.

liangyuanpeng commented 1 month ago

As for the purpose of verifying collected daemonSet status result, we just check status.currentNumberScheduled maybe is enough? No need to cost time to wait every pod available.

It depends on what we want to test. If we are testing PP, I think we may need to wait for the workload to be ready. Otherwise, how can prove that this is not a problem with karmada?

chaosi-zju commented 1 month ago

Thank you for your opinion, and let me share my views:

I think we may need to wait for the workload to be ready. Otherwise, how can prove that this is not a problem with karmada?

Unlike CurrentNumberScheduled, if the value of CurrentNumberScheduled is not expected, it means that there must be a problem with the internal code of Karmada, and this test is valuable.

However, if the value of NumberReady is not expected, it does not mean that there must be a problem with Karmada. It is very likely that the member cluster itself is the problem, so this test judgment is a bit speculative.

It depends on what we want to test.

I agree at this point. If you want to test a complete process for replicas propagation, replicas ready is a necessary condition.

But, as the test described, check whether the daemonSet status can be correctly collected, yet NumberReady not equal wantedReplicas doesn't mean status collected failed. In the case of correct collection, inequality of NumberReady and wantedReplicas is allowed.

Besides, to some extent, fields such as currentNumberScheduled is enough to verify that the whether status collection process success.

XiShanYongYe-Chang commented 1 month ago

It depends on what we want to test. If we are testing PP, I think we may need to wait for the workload to be ready. Otherwise, how can prove that this is not a problem with karmada?

Hi @liangyuanpeng, in the current test, we are testing the aggregation of the workload resource status. We really should test that each field in the status is as expected. However, in the CI test environment, some workloads take a long time to pull images from member clusters and run successfully. Even more than the waiting time we set (a full 7 minutes), then we can not sacrifice a little bit of field detection, such as status replicas in ready (The logic for processing other fields is the same as that for processing this field. Therefore, this field can be considered as an equivalence test.), which will speed up our testing. How do you think?

chaosi-zju commented 1 month ago

hi @liangyuanpeng, any more opinions? Do you still insist on your previous views?

liangyuanpeng commented 1 month ago

Thanks for your clear! it's make sense for me.

XiShanYongYe-Chang commented 1 month ago

In favor of #5258 /assign @chaosi-zju