Open mimowo opened 1 week ago
/cc @tenzen-y @alculquicondor @gabesaba
FYI the previous effort (for x-ref): https://github.com/kubernetes-sigs/kueue/issues/1737
/assign
Actually, I proposed refactoring the MultiKueue Kubeflow E2E and integration testing. @mszadkow How about progressing?
@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance?
Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since https://github.com/kubernetes-sigs/kueue/issues/1737#issuecomment-1978255495, which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression).
EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized.
For starters I propose #3035 which adds a way to report the time taken by individual tests.
When it comes to overall time consumption the trend was fairly study, the biggest bump (of around 2min) I see is around 05 Jul when race detection was added.
$ git log 35586d7539bff45e071d39f7e85ebc87e4245c97..cd89852f2c4d921e2ec51917152f8fdea80eb87d
commit cd89852f2c4d921e2ec51917152f8fdea80eb87d
Author: Irving Mondragón <IrvingMg@users.noreply.github.com>
Date: Thu Jul 4 22:51:06 2024 +0200
Remove deprecated Hugo template (#2506)
commit 6c619c6becea43415bb189067c5e94e8dcda355f
Author: Mykhailo Bobrovskyi <mikhail.bobrovsky@gmail.com>
Date: Thu Jul 4 20:48:06 2024 +0300
Runs the race detector on integration tests. (#2468)
I think the race detection proved to be useful already, so the trade off is not clear to me. I would keep it for now, but just keep in mind that we could gain 2min on changing it.
@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance?
Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since https://github.com/kubernetes-sigs/kueue/issues/1737#issuecomment-1978255495, which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression).
EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized.
@mimowo Actually, I meant the following discussion: https://github.com/kubernetes-sigs/kueue/pull/2869#discussion_r1727318416
@mimowo Actually, I meant the following discussion: #2869 (comment)
The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e.
@mimowo Actually, I meant the following discussion: #2869 (comment)
The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e.
Oh, I was supposed to mention integration testing as well. Because we have duplicated Kubeflow MultiKueue integration testing even though the core Kubeflow MultiKueue reconcilers are commonized and the same.
I see, so your suggestion is to only keep integration tests for PyTorchJob
, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.
I see, so your suggestion is to only keep integration tests for
PyTorchJob
, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.
That can be easily done but I don't expect a huge gain out of that , each of the tests are taking under 3 sec to execute.
With #3039 I tried to parallelize the cluster creation for multikueue in theory that can reduce the tome with around 40s, but making the setup thread safe is more challenging then expected.
Another thing we can try is to reuse the envtest clusters and setup-manager / stop-manager operations, but this may not make too much difference with parallel running and surface new issues due to incomplete cleanups.
We have had an effort to improve the integration test performance in the past, and we've taken down the time below 10min.
However, recently the integration tests suite takes over 16min based on https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-integration-main. Specific example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-integration-main/1831572082656284672
Part of the effort would be to figure out if the slow down can be attributed to more tests, or there is another reason.
The build time is important particularly during release process which takes a couple of builds.