Closed ElenaKhaustova closed 1 month ago
datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset
If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?
datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset
If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?
3475
As I tried to explain above after we add runtime
pattern - which matches all datasets it remains in the catalog when the second run. So this part of the condition is not valid - don't match a pattern in the catalog
. Previously there was a different logic when resolution and runtime
patterns were processed separately.
datasets that aren't in the catalog and don't match a pattern in the catalog and include MemoryDataset
If a dataset is defined in catalog or as a pattern, the catalog would not return as, with the exception of MemoryDataset. In this case it should be still return because the runtime pattern add a memory dataset. Can you clarify why changing the pattern resolution somehow breaks this?
3475
if ds in catalog
is True
for all datasets as they match runtime
pattern on the second run
https://github.com/kedro-org/kedro/blob/ba981350ad57dbcaabf5fd758a9a3d4399a91f20/kedro/runner/runner.py#L90
I get this in the second run it will be in the registered_ds
. But the logic here is:
free_output = output - register_ds (excluding if is a memory dataset)
So my expectation here is that, even if it's registered, it will still be return.
Trace:
I run this twice in the test and add some printing statement in runner.py
pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['params:model_options', 'model_input_table']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs={'y_test', 'X_test', 'regressor'}
pipeline.outputs()={'y_test', 'X_test', 'regressor'}
registered_ds=['X_test', 'params:model_options', 'model_input_table', 'X_train', 'regressor', 'y_test', 'y_train']
memory_datasets={'model_input_table', 'params:model_options'}
free_outputs=set()
This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets
memory_datasets = {
ds_name
for ds_name, ds in catalog._datasets.items()
if isinstance(ds, MemoryDataset)
}
This is the result I get with other test, I think the problem is where we define Memory dataset, in 2nd run I expected, y_test, X_test, regressor in the memory_datasets
They are not in the catalog because we make
catalog = catalog.shallow_copy(
extra_dataset_patterns=self._extra_dataset_patterns
)
before calling _run()
, so we modify different catalog
object. But runtime
pattern is stored in CatalogConfigResolver
, so it remains the same between the runs.
In the new catalog, it will be as you expect.
Please note that shallow_copy
will be removed from the new catalog, but the mechanism to remove runtime patterns after the run will be helpful anyway to not affect consequent runs.
Moved to draft now as want to double-check some cases that @noklam shared.
For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/
For reference I shared my edge cases here: https://github.com/noklam/kedro-runner-bug-investigation/commits/main/
The case shared above was also not handled by the previous kendo versions. It relates to the issue raised by the user, so an alternative solution was added to fix both.
The issue is that we were processing runtime patterns separately from others and never returning them in the run()
output, even if they were persistent, like in the example above.
In the solution we move the logic of accessing datasets in the catalog after runtime pattern is added, and the logic of accessing MemoryDataset
's after the run, so they appear in the catalog in case they were added with a pattern. The same is done for SharedMemoryDataset
cause that's the default dataset pattern (runtime pattern) for ParallelRunner
.
@noklam
Added test to run node twice and double-checked that it failed without changes made.
Description
After the pattern resolution logic refactoring we process all the patterns (
dataset
,default
,runtime
) together.As a result of
run()
we return:Before the
run()
we add a runtime pattern to the catalog, so we could process all intermediate outputs asMemoryDataset
.So currently when we do two consecutive runs https://github.com/kedro-org/kedro/issues/4235 runtime pattern
{default}
added after the first run affects the next runs, so that all datasets match it and we do not return anything as a result ofrun()
cause we think these datasets are in the catalog.Development notes
We think that the resolution logic is correct and all patterns should be processed together as we do now. To avoid this behaviour we added method to remove runtime patters after the run, so they only live within the run and do not affect other runs.
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file