kubernetes-sigs / kueue

Kubernetes-native Job Queueing
https://kueue.sigs.k8s.io
Apache License 2.0
1.33k stars 233 forks source link

Revisit optimizing the integration test runtime #2993

Open mimowo opened 1 week ago

mimowo commented 1 week ago

We have had an effort to improve the integration test performance in the past, and we've taken down the time below 10min.

However, recently the integration tests suite takes over 16min based on https://testgrid.k8s.io/sig-scheduling#periodic-kueue-test-integration-main. Specific example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-integration-main/1831572082656284672

Part of the effort would be to figure out if the slow down can be attributed to more tests, or there is another reason.

The build time is important particularly during release process which takes a couple of builds.

mimowo commented 1 week ago

/cc @tenzen-y @alculquicondor @gabesaba

mimowo commented 1 week ago

FYI the previous effort (for x-ref): https://github.com/kubernetes-sigs/kueue/issues/1737

trasc commented 1 week ago

/assign

tenzen-y commented 6 days ago

Actually, I proposed refactoring the MultiKueue Kubeflow E2E and integration testing. @mszadkow How about progressing?

mimowo commented 6 days ago

@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance?

Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since https://github.com/kubernetes-sigs/kueue/issues/1737#issuecomment-1978255495, which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression).

EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized.

trasc commented 4 days ago

For starters I propose #3035 which adds a way to report the time taken by individual tests.

trasc commented 4 days ago

When it comes to overall time consumption the trend was fairly study, the biggest bump (of around 2min) I see is around 05 Jul when race detection was added.

$ git log 35586d7539bff45e071d39f7e85ebc87e4245c97..cd89852f2c4d921e2ec51917152f8fdea80eb87d
commit cd89852f2c4d921e2ec51917152f8fdea80eb87d
Author: Irving Mondragón <IrvingMg@users.noreply.github.com>
Date:   Thu Jul 4 22:51:06 2024 +0200

    Remove deprecated Hugo template (#2506)

commit 6c619c6becea43415bb189067c5e94e8dcda355f
Author: Mykhailo Bobrovskyi <mikhail.bobrovsky@gmail.com>
Date:   Thu Jul 4 20:48:06 2024 +0300

    Runs the race detector on integration tests. (#2468)
mimowo commented 4 days ago

I think the race detection proved to be useful already, so the trade off is not clear to me. I would keep it for now, but just keep in mind that we could gain 2min on changing it.

tenzen-y commented 4 days ago

@tenzen-y what is the aim of the refactoring, can you point to the discussion? How will this impact the integration test performance?

Also, before we start optimizing the status quo, I would really appreciate investigation why the time increased from 9min to 16min since https://github.com/kubernetes-sigs/kueue/issues/1737#issuecomment-1978255495, which was just 6 months ago. It could be just due to new tests, but maybe there is something else responsible (like slower machines or performance regression).

EDIT: what I mean by that is we can check first what the new integration tests added since then, and if they can really account for the additional 7min - I'm surprised by the increase just within 6 months. Maybe some of the new tests are not optimized.

@mimowo Actually, I meant the following discussion: https://github.com/kubernetes-sigs/kueue/pull/2869#discussion_r1727318416

mimowo commented 4 days ago

@mimowo Actually, I meant the following discussion: #2869 (comment)

The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e.

tenzen-y commented 4 days ago

@mimowo Actually, I meant the following discussion: #2869 (comment)

The decision taken in this discussion sgtm, but I'm not seeing how it is related to this issue, which is focused on integration tests rather than e2e.

Oh, I was supposed to mention integration testing as well. Because we have duplicated Kubeflow MultiKueue integration testing even though the core Kubeflow MultiKueue reconcilers are commonized and the same.

mimowo commented 4 days ago

I see, so your suggestion is to only keep integration tests for PyTorchJob, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.

trasc commented 4 days ago

I see, so your suggestion is to only keep integration tests for PyTorchJob, and drop for other kubeflow Jobs? I'm open to this possibility but would expect some review of the time impact, and review of the test coverage provided by unit tests for the integrations.

That can be easily done but I don't expect a huge gain out of that , each of the tests are taking under 3 sec to execute.

With #3039 I tried to parallelize the cluster creation for multikueue in theory that can reduce the tome with around 40s, but making the setup thread safe is more challenging then expected.

Another thing we can try is to reuse the envtest clusters and setup-manager / stop-manager operations, but this may not make too much difference with parallel running and surface new issues due to incomplete cleanups.