YosuaMichael commented 1 year ago

Recently I experiment with torchvision test on #6992 and I found the result is quite promising. I create this issue to give summary of the experiment, propose to separate the model test (test_models.py and test_backbone_utils.py) from other test, and ask for feedback.

Highlights

By separating the model test (test_models.py and test_backbone_utils.py) we can have:

Faster test waiting time for developer (Reduce waiting time from 56 minutes to 38 minutes, ~32% improvement)
Sum of all test time (the machine time) is reduced from 617 minutes to 415 minutes (~33% reduction) and more over the machine type for non-model test can be reduced to use machine that are 2 times cheaper.
Some test has too much input variations and this can be reduced to speedup (for example: test_augmix)

Action plan:

After talking with @osalpekar , I will separate the model test for the workflow that already using GHA. And this will break down the progress by the OS.

[ ] Separate model test for linux-cpu in GHA [In progress]
[ ] Separate model test for macos-cpu in GHA [Waiting for the workflow to move to GHA]
[ ] Separate model test for windows-cpu in GHA [Waiting for the workflow to move to GHA]
[x] Discuss on test that we may be able to speedup by reducing number of input variation
[ ] Create a PR to speedup augmix by reducing input variation

Experiments

Analyze the test running time

First thing that I did is to use --durations 5000 for the linux test with python 3.8 (both cpu and gpu) so it will print each test durations (here is the data on google sheet).

For the raw data, each row correspond to a test with specific arguments or inputs, for instance a test_classification_model for resnet50 model. From this, I aggregate by the test function and the test file to see a more aggregated result.

Here are the interesting findings on this data:

We could see that the two slowest test files on cpu are test_models.py and test_backbone_utils.py which take 34.36% and 28.46% of the total test duration respectively. On GPU, we could see that test_models.py took 61.08% of the total test duration.
test_transforms.py::test_augmix took 145.6 seconds, whereas all test_transforms.py took 172.7 seconds (~ 84.3% of total test_transforms.py duration is from test_augmix alone)

Experiment on separating model and backbone_utils test

Both test_models.py and test_backbone_utils.py are pretty similar in a sense that they are testing the models. For big models, they require high memory usage and can be pretty slow. Since they are testing models, the API that we are testing are pretty high level, hence we might not need to run the test on all python version (since the low level operator should be tested on all python level already).

From this reasoning, I think it would be beneficial to separate the model test (I will refer both test_models.py and test_backbone_utils.py as model test) and only run it on one python version only (Consideration on running on all python version for main and nightly branch only like how we do gpu test).

I did experiment on separating them, and I use pytest marker to do it. Here are some main things that I did to separate the test:

Add global variable pytestmark = pytest.mark.slow on the test_models.py and test_backbone_utils.py to mark all the test inside as pytest.mark.slow
Modify pytest.ini addopts to have -m "not slow" so by default it will skip the test marked as slow
Modify .circleci/regenerate.py and .circleci/config.yml.in so that it accept pytest_additional_args as parameters, and we will pass this to run_test.sh scripts to add -m "slow" to run only the model test or -m "slow or not slow" to run all test

(for more detail see #6992)

After separating the model test, I notice that in linux gpu we have a relatively high overhead for installing environment. Because of this overhead, we decide that it is not worth it to separate the test on GPU, hence we modify .circleci/regenerate.py so that we only separate the model test on CPU.

Experiment on reducing machine type for non model test

After separating the model test, we have idea to reduce the machine type for the non-model test. The intuition is that we previously need higher machine type because the model test require high memory, hence if we separate the model test we can run the non-model test with lower spec machine.

So initially we run the test on 2xlarge+ machine for linux and large machine for macos. We found that for non-model test we could use xlarge machine for linux and medium machine for macos instead, and since the model test is the slowest part it does not really affect the overall waiting time for all test to finished! This would potentially a good cost saving.

For more detail on experiment, see this docs.

Experiment on optimizing test_augmix

Once we separate the model test, test_augmix become very significant among the non-model test, to be precise it occupy around 26% of the non-model test durations. Hence it would be good if we can speedup this test.

Looking at our data, seems like test_augmix is called 97 times (it is called with 96 different input, and 1 time for setup). The first thing that we want to try is to reduce the amount of input variation for this test.

Looking at the code, currently they have input parameters as follow:

@pytest.mark.parametrize("fill", [None, 85, (128, 128, 128)])
@pytest.mark.parametrize("severity", [1, 10])
@pytest.mark.parametrize("mixture_width", [1, 2])
@pytest.mark.parametrize("chain_depth", [-1, 2])
@pytest.mark.parametrize("all_ops", [True, False])
@pytest.mark.parametrize("grayscale", [True, False])

Although each parameters only contain 2 or 3 different values, but because the exponential nature, the amount of total different inputs quickly grow to: 3 * 2 * 2 * 2 * 2 * 2 = 96. The idea to reduce the total input variation is to sample from these 96 combinations that make sure at least each value on each parameters is chosen at least once. This should be okay since I think we dont really need to test all the 96 combinations. As what we did on #6992 , here are the code replacements for the input parameters:

@pytest.mark.parametrize(
    "fill,severity,mixture_width,chain_depth,all_ops,grayscale",
    [
        (None, 1, 1, -1, True, True),
        (85, 10, 2, 2, False, False),
        ((128, 128, 128), 1, 2, -1, False, True),
        (None, 10, 1, 2, True, False),
        (85, 1, 1, -1, False, False),
        ((128, 128, 128), 10, 2, -1, True, False),
    ],
)

Now we only have 6 variations, which should speedup the test_augmix 16x.

Note that we can increase the number of sample that we want to use, this will depend on the test owner to decide. Also, I think this method is also applicable for all the test that use a large number of varying input parameters.

cc @pmeier @datumbox @osalpekar

datumbox commented 1 year ago

Great job @YosuaMichael.

I think your proposal makes a lot of sense. Separating all the model tests (both models and backbone_utils) from the rest is something worth doing for the reasons you listed above. Additionally this will allow us on the future to consider running the non model tests in parallel, something we can't do at the moment due to memory issues caused by running the models in parallel.

Concerning Augmix, I agree with your proposed strategy. We don't need the combination of all parameters and the strategy of covering all programmatic paths should be OK. Note that the V2 of the API optimized Augmix by 20% so hopefully this will further reduce the execution time.

The only part of the proposal that requires caution is running the models only with 1 python / CUDA configuration. We've recently observed that models can produce different results depending on the version of CUDA. That happened during the deprecation of CUDA 10.2 and the migration to 11.6/11.7. So right now it is possible for the model tests to pass on one version of CUDA and not on another. This doesn't invalidate your proposal, we just need to make sure we cover "usual suspects" with tests. In other words, running different python versions is not necessary but running the tests with different CUDA versions is probably worth it. Still this will allow you to reduce significantly the amount of resources you use.

I think this proposal is worth implementing.

YosuaMichael commented 1 year ago

Hi @datumbox, thanks for the feedback.

Regarding the cuda version, what we have currently is to run only on python 3.8 for every commit while other python version will run only on main and nightly branch, also it only run on CUDA 11.6 as of now (see here and observe that only python3.8 dont have the filters and only cuda 11.6 is used).

And I am thinking of following this on the separation, basically only to run the model test on all python vertsion in main and nightly branch for the CPU version. For the GPU version for the model test, actually the plan is not to separate the model test since currently the setup time for the GPU machine is comparable / longer than the running time for the test itself. Also, since as of now we only run test on cuda 11.6, there is not much gain we will get by separating the model test on GPU.

pytorch / vision

[CI] Separating model and backbone_utils test from the others #7000