kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.33k stars 231 forks source link

[Flaky] when Creating a multikueue admission check Should run a kubeflow XGBoostJob #2838

Open alculquicondor opened 1 month ago

alculquicondor commented 1 month ago

What happened:

End To End MultiKueue Suite: kindest/node:v1.30.0: [It] MultiKueue when Creating a multikueue admission check Should run a kubeflow XGBoostJob on worker if admitted expand_less    9s
{Timed out after 5.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:688 with:
Expected object to be comparable, diff:   &v1.ReplicaStatus{
-   Active:        1,
+   Active:        0,
-   Succeeded:     0,
+   Succeeded:     1,
    Failed:        0,
    LabelSelector: nil,
    Selector:      "",
  }
 failed [FAILED] Timed out after 5.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:688 with:
Expected object to be comparable, diff:   &v1.ReplicaStatus{
-   Active:        1,
+   Active:        0,
-   Succeeded:     0,
+   Succeeded:     1,
    Failed:        0,
    LabelSelector: nil,
    Selector:      "",
  }
In [It] at: /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/multikueue/e2e_test.go:703 @ 08/15/24 06:05:55.326
}

What you expected to happen:

Test to pass

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

alculquicondor commented 1 month ago

/assign @mszadkow

tenzen-y commented 1 month ago

/kind flake

The XGBoostJob has some state transition bugs. So, maybe we need to remove the test case from Kueue or fix the root bug in the training-operator.

alculquicondor commented 1 month ago

I see, thanks for the context.

@mszadkow any chance you can take a look in the training-operator code? In the meantime, let's disable this test by calling ginkgo.Skip() with an accompanying comment.

mszadkow commented 4 weeks ago

@tenzen-y Can you explain more about the transition bug, is it known one?

mszadkow commented 4 weeks ago

Yes, sure I can have a look there but like you said will skip it for now.

tenzen-y commented 4 weeks ago

@tenzen-y Can you explain more about the transition bug, is it known one?

Depending on historical reasons, we just used to rerun the failed flaky tests in the TrainingOperator. So, we do not have a dedicated issue for specific transitions.

But, we explained the transition issue a little bit here: https://github.com/kubeflow/training-operator/issues/1711