kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Add method to remove runtime patterns after run #4236

Closed ElenaKhaustova closed 1 month ago

ElenaKhaustova commented 1 month ago

Description

After the pattern resolution logic refactoring we process all the patterns (dataset, default, runtime) together.

As a result of run() we return:

  datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

https://github.com/kedro-org/kedro/blob/ba981350ad57dbcaabf5fd758a9a3d4399a91f20/kedro/runner/runner.py#L108

Before the run() we add a runtime pattern to the catalog, so we could process all intermediate outputs as MemoryDataset.

So currently when we do two consecutive runs https://github.com/kedro-org/kedro/issues/4235 runtime pattern {default} added after the first run affects the next runs, so that all datasets match it and we do not return anything as a result of run() cause we think these datasets are in the catalog.

Development notes

We think that the resolution logic is correct and all patterns should be processed together as we do now. To avoid this behaviour we added method to remove runtime patters after the run, so they only live within the run and do not affect other runs.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

noklam commented 1 month ago

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

https://github.com/kedro-org/kedro/pull/3475

ElenaKhaustova commented 1 month ago

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

3475

As I tried to explain above after we add runtime pattern - which matches all datasets it remains in the catalog when the second run. So this part of the condition is not valid - don't match a pattern in the catalog. Previously there was a different logic when resolution and runtime patterns were processed separately.

ElenaKhaustova commented 1 month ago

datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset

If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?

3475

if ds in catalog is True for all datasets as they match runtime pattern on the second run https://github.com/kedro-org/kedro/blob/ba981350ad57dbcaabf5fd758a9a3d4399a91f20/kedro/runner/runner.py#L90

noklam commented 1 month ago

I get this in the second run it will be in the registered_ds. But the logic here is: free_output = output - register_ds (excluding if is a memory dataset)

So my expectation here is that, even if it's registered, it will still be return.

https://github.com/kedro-org/kedro/blob/ba981350ad57dbcaabf5fd758a9a3d4399a91f20/kedro/runner/runner.py#L110

Trace: I run this twice in the test and add some printing statement in runner.py

pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['params:model_options', 'model_input_table']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs={'y_test', 'X_test', 'regressor'}

pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['X_test', 'params:model_options', 'model_input_table', 'X_train', 'regressor', 'y_test', 'y_train']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs=set()

This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets

Identify MemoryDataset in the catalog

    memory_datasets = {
        ds_name
        for ds_name, ds in catalog._datasets.items()
        if isinstance(ds, MemoryDataset)
    }
ElenaKhaustova commented 1 month ago

This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets

They are not in the catalog because we make

        catalog = catalog.shallow_copy(
            extra_dataset_patterns=self._extra_dataset_patterns
        )

before calling _run(), so we modify different catalog object. But runtime pattern is stored in CatalogConfigResolver, so it remains the same between the runs.

In the new catalog, it will be as you expect.

ElenaKhaustova commented 1 month ago

Please note that shallow_copy will be removed from the new catalog, but the mechanism to remove runtime patterns after the run will be helpful anyway to not affect consequent runs.

ElenaKhaustova commented 1 month ago

Moved to draft now as want to double-check some cases that @noklam shared.

noklam commented 1 month ago

For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/

ElenaKhaustova commented 1 month ago

For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/

The case shared above was also not handled by the previous kendo versions. It relates to the issue raised by the user, so an alternative solution was added to fix both.

The issue is that we were processing runtime patterns separately from others and never returning them in the run() output, even if they were persistent, like in the example above.

In the solution we move the logic of accessing datasets in the catalog after runtime pattern is added, and the logic of accessing MemoryDataset's after the run, so they appear in the catalog in case they were added with a pattern. The same is done for SharedMemoryDataset cause that's the default dataset pattern (runtime pattern) for ParallelRunner.

ElenaKhaustova commented 1 month ago

@noklam

Added test to run node twice and double-checked that it failed without changes made.