Closed qianheng-aws closed 6 hours ago
@qianheng-aws this is a great idea ! Is there any down side for this (option 1 vs option 2) parallelism ?
@qianheng-aws this is a great idea ! Is there any down side for this (option 1 vs option 2) parallelism ?
Here is the pros and cons comparing these 2 options, and also added it in the description: Option1: Enable SBT's parallel execution in one node. Pros: Easy to implement Cons: Increase pressure on the building node, has possibility to make integ-test unstable if too much parallelism. It will launch at most 4(CPU cores) docker containers and JVM. This optimization has upper bound limited by the performance of building node.
Option2: Add more nodes in CI and distribute tests equally to these nodes. Pros: Can scaling as many building node as possible if we want. Cons: Increase the complexity of the CI workflow since we're going to distribute tests to different building nodes and so need to merge their reports when all nodes have finished their tasks in the end. And it will also increase our spending on CI resources since we will use more building nodes.
Description
Enable parallel integration.
Based on the metrics collected:
The time cost of each suite is somehow faired. Most of test suites cost less than 1min and maximum cost is no more than 7 mins.
To reduce test execution time, we should increase parallelism, especially since we don't have any long-running test suites and all tests currently run sequentially.
TODO: There is another thought to reduce the average testing time for each suites is reusing the docker container among suites. It cost around 10 secs to bootstrap a container for OpenSearch. It will save 10 minutes if running integration(65 suites currently) in sequence.
There are 2 ways to increase parallelism:
Option1: Enable SBT's parallel execution in one node. Pros: Easy to implement Cons: Increase pressure on the building node, has possibility to make integ-test unstable if too much parallelism. It will launch at most 4(CPU cores of building node) docker containers and JVM. This optimization has upper bound limited by the performance of building node.
Option2: Add more nodes in CI and distribute tests equally to these nodes. Pros: Can scaling as many building node as possible if we want. Cons: Increase the complexity of the CI workflow since we're going to distribute tests to different building nodes and so need to merge their reports when all nodes have finished their tasks in the end. And it will also increase our spending on CI resources since we will use more building nodes.
These 2 options are compatible and can apply both of them if we want. Take option1 as the first step, as it can save resource and won't increase the workflow's complexity.
Option1 Test, time cost of integ-test recording: baseline -> 1h 3m 35s 4 groups -> 32m 17s 3 groups -> 37m 58s
Try to shuffle tests before splitting into groups: 4 groups with shuffle -> 32m 42s 3 groups with shuffle -> 38m 37s
Related Issues
Resolves https://github.com/opensearch-project/opensearch-spark/issues/853
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.