kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 897 forks source link

Refactor tests for the new project creation flow in `test_starters.py` #3594

Open merelcht opened 8 months ago

merelcht commented 8 months ago

Description

The test_starters.py file has grown too much. It contains a huge amount of tests and when adding a new test it's not very clear where it should go. There's also inconsistency in how the test classes are divided. Some test valid and invalid only scenarios others contain a mix.

Possible Implementation

This a suggestion of how the test_starters.py could be refactored:

This PR shows this break down of the test_starters.py tests: https://github.com/kedro-org/kedro/pull/3550

Problems

❗ Currently this setup is not working because depending on how the tests are run (individually vs as part of the full test suite) there's an issue with the .cookiecutter cache.

``` E Traceback (most recent call last): E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/shutil.py", line 664, in _rmtree_safe_fd E os.rmdir(entry.name, dir_fd=topfd) E OSError: [Errno 39] Directory not empty: 'pack' E E During handling of the above exception, another exception occurred: E E Traceback (most recent call last): E File "/home/runner/work/kedro/kedro/kedro/framework/cli/starters.py", line 895, in _create_project E result_path = cookiecutter(template=template_path, **cookiecutter_args) E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/cookiecutter/main.py", line 80, in cookiecutter E base_repo_dir, cleanup_base_repo_dir = determine_repo_dir( E ^^^^^^^^^^^^^^^^^^^ E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/cookiecutter/repository.py", line 107, in determine_repo_dir E cloned_repo = clone( E ^^^^^^ E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/cookiecutter/vcs.py", line 97, in clone E clone = prompt_and_delete(repo_dir, no_input=no_input) E ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/cookiecutter/utils.py", line 95, in prompt_and_delete E rmtree(path) E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/cookiecutter/utils.py", line 33, in rmtree E shutil.rmtree(path, onerror=force_delete) E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/shutil.py", line 732, in rmtree E _rmtree_safe_fd(fd, path, onerror) E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/shutil.py", line 660, in _rmtree_safe_fd E _rmtree_safe_fd(dirfd, fullname, onerror) E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/shutil.py", line 660, in _rmtree_safe_fd E _rmtree_safe_fd(dirfd, fullname, onerror) E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/shutil.py", line 666, in _rmtree_safe_fd E onerror(os.rmdir, fullname, sys.exc_info()) E File "/opt/hostedtoolcache/Python/3.11.7/x64/lib/python3.11/site-packages/cookiecutter/utils.py", line 25, in force_delete E func(path) E OSError: [Errno 39] Directory not empty: '/home/runner/.cookiecutters/kedro-starters/.git/objects/pack' E ```

I've seen this locally several times as well.

Idea on how to fix this

The original starter tests added before any of the tools work was done mocked all the git interactions. Maybe the tools tests should be setup in a different way. We already have e2e tests to test the full flow. I'm thinking maybe these shouldn't be interacting with the full on starter and git interaction but just unit test the inputs.

Possible Alternatives

Move all tests that need starters to the kedro-starters repo.

noklam commented 8 months ago

Copying some comments to here:

The original starter tests added before any of the tools work was done mocked all the git interactions. Maybe the tools tests should be setup in a different way. We already have e2e tests to test the full flow. I'm thinking maybe these shouldn't be interacting with the full on starter and git interaction but just unit test the inputs.

I agree most of it in principles, the tests that we have currently is slightly duplicate and arguably not a real "unit test". In addition, the tests runs way too slow, because it requires network travel to fetch the actual starters, which is why the original tests mock git and fetch the local one instead. This comes with additional cons because you cannot run the tests without internet (seems violating what unit tests should be).

I expected there will still be some coupling because the post_gen_hook.py logic is defined in starters side. I think this is the main problem and there is something that we can do about it.

See this post_gen_hook in spaceflights-pyspark and post_gen_hook in spaceflights-pysparks-viz. They are both executing the exact same main function, I am pretty sure this is redundant because at this stage the correct starter has already been selected, so for sure viz and pyspark are selected if spaceflights-pyspark-viz is cloned.

If we can unified this logic then maybe we can just keep one single mock_repo in kedro instead of kedro-starters, this will likely make development much easier too. Bonus point: I wonder if the main logic can moved into kedro