kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Rewrite starter tests + update docs #3954

Open yury-fedotov opened 5 months ago

yury-fedotov commented 5 months ago

Description

This:

@pytest.fixture
def project_context(config_loader):
    return KedroContext(
        package_name="my_package_name",
        project_path=Path.cwd(),
        config_loader=config_loader,
        hook_manager=_create_hook_manager(),
    )

Fails the only unit test because a KedroContext requires env argument to be supplied.

Context

I resolved it for my project by specifying env=None there, but it should work out of the box from the starter.

Steps to Reproduce

  1. Initialize kedro new with all defaults except pyspark
  2. pip install -e .
  3. pytest

Your Environment

ankatiyar commented 5 months ago

Thanks @yury-fedotov for reporting this, I'll add it to the backlog since the solution in #3958 is not ideal.

Context

Changes in 0.19 PR #3300 changed the ordering of the arguments to KedroContext and removed the default value for env. The example test in the starters <project_root>/tests/test_run.py is failing as a result.

I tried adding a default value to env which requires reordering the arguments since arguments with default values can't precede arguments without default values in #3958 but this is a breaking change. Adding default values to all the subsequent arguments also doesn't make sense. (@noklam's comment https://github.com/kedro-org/kedro/pull/3958#issuecomment-2182313144)

The fix probably needs to be in the starters for now.

My suggestion is not to fix anything in kedro, but remove the test. In fact, I don't think the KedroContext has any use in tests. In a standard kedro run, we should create the session and let the session to create context, runner, config_loader etc. However, in our test template, we created the config loader and runner manually. In addition, the context actually use the real hooks, but the tests that created run with no hook, so they are not consistent at all.

In our own testing docs, KedroContext is not mentioned at all.

yury-fedotov commented 5 months ago

Thanks @ankatiyar !

I agree that the change should be in the starter, not in the framework.

However I'm curious why not keep the test and just specify env=None (as it wants) while initializing a context within this test?

In other words, change the code of the test fixture rather than the constructor of the context.

That would allow to keep the test and close this specific issue with minimal changes.

noklam commented 5 months ago

@yury-fedotov I think the change should go to starter, and I don't like the current test example. I just found another problem https://github.com/kedro-org/kedro/issues/3966 with the current test.

My point is I prefer fixing the test example altogether rather than just the specific KedroContext one.

@pytest.fixture
def config_loader():
    return OmegaConfigLoader(conf_source=str(Path.cwd()))

@pytest.fixture
def project_context(config_loader):
    return KedroContext(
        package_name="{{ cookiecutter.python_package }}",
        project_path=Path.cwd(),
        config_loader=config_loader,
        hook_manag

I find this weird, because usually I expect two common way to do this:

  1. Start from the highest level KedroSession
with KedroSession.create(...) as session:
   context = session.load_context()
   config_loader = context.config_loader
   ...

This way you have the KedroContext, ConfigLoader created for you, and it would respect to your settings.py in case you customised the project settings. This would also respect hooks (which can be useful, but you may also not want hook depending on the test you are writing, i.e. you don't want to log Mlflow for unit test)

  1. Create invidivual Component directly, i.e.
    @pytest.fixture
    def config_loader():
    return OmegaConfigLoader(conf_source="conf",base_env="base", default_run_env="local")

In this case, settings.py is not respected and this is more lower level. It could be useful for unit testing if you need fine grain control. In any case, I found the KedroContext fixture not useful at all. The current example create the KedroContext but then the ConfigLoader is not created from KedroContext, what's the point of having it in a user facing codebase? Do you have test relying on the KedroContext fixture?

ElenaKhaustova commented 4 months ago

@noklam shall we create an issue on the starters repo to address these two issues together?

merelcht commented 3 months ago

Use these docs as inspiration for the new tests: https://docs.kedro.org/en/stable/tutorial/test_a_project.html

Update these docs to correspond to the test changes: https://docs.kedro.org/en/stable/development/automated_testing.html#create-an-example-test

merelcht commented 3 months ago

Some extra context: https://linen-slack.kedro.org/t/22301026/question-is-there-an-extra-step-required-to-run-pytest-with-#661646b5-a2e7-4b77-a3db-ab585f408793. The tests should ideally make use of the project settings and not hard-code too much, so they also show users how the configuration loading works. Long story short: not too much mocking and no hard-coding.

yury-fedotov commented 3 weeks ago

@merelcht I just started a new project with Kedro 0.19.9, seems like the issue is still there. I.e. pytest . on a newly initialized project fails because of this.

Another user reporting this: https://github.com/kedro-org/kedro-starters/issues/221

merelcht commented 3 weeks ago

Thanks @yury-fedotov, I noticed the other issue on starter as well. I have now marked this to be done in our next sprint.