Open gtauzin opened 1 week ago
From @deepyaman on the kedro slack:
This seems quite possible, though, as _list_partitions is cached, so anything that attempts to access it may have hit this code: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py#L257-L264 There is an exists check that was introduced in https://github.com/kedro-org/kedro/pull/3332 that is triggered before pipeline run, that can populate this cache. @Merel may have some more familiarity, since she worked on #3332
He also suggested to create a custom PartitionedDataset
and remove the caching decorator on_list_partitions
as a workaround. I can confirm that this workaround worked for me.
Description
I have a pipeline in which the first node generate files that are then picked up by a follow up node using a partition dataset. If I run this pipeline with
ParallelRunner
, the partition file list is created before the whole pipeline is ran, thus making it impossible to find the files created by the prior node.Context
I have a pipeline with two nodes that is applied by source buckets using namespaces. I would like to have it run in parallel (one process per source).
The way I achieve this with Kedro is:
IncrementalDataset
and the concatenated dataframe is saved using a versionedParquetDataset
.PartionedDataset
that is able to find all preprocessed recorded event computer so far (withload_args
withdirs
andmax_depth
set accordingly)Node 1 will also return a boolean that node 2 takes an input so that the resulting DAGS has a dependency link from node 1 to node 2.
Steps to Reproduce
Here is what the pipeline code looks like:
And the catalog:
Putting even a single parquet file for a single
source
indata/01_raw/source_1
and creating a pipeline from the template pipeline method for namespace set tosource_1
allows to reproduce the bug. For the sake of clarity, I did not provide theconcatenate_increment
andconcatenate_partition
node function, but I can provide them if needed. They are basically just callingpd.concat
.Expected Result
Pipeline runs successfully and results running this pipeline using
SequentialRunner
orParallelRunner
are identical.Actual Result
Pipeline runs fine with
SequentialRunner
, but when ran withParallelRunner
, it complains that there are no files in the partition:Your Environment
pip show kedro
orkedro -V
): 0.19.6python -V
): 3.12