Closed yury-fedotov closed 1 month ago
Haven't read the full thing. Was this working prior 0.19? In general we recommend ThreadRunner
because multiprocess doesn't work with Spark. The computation doesn't happened locally anyway so it does not make sense to use multiprocess.
Would you be able to provide an demo repository that we can run on other side? Something modify from the existing starter would be good enough.
@noklam Hey! Sorry for late reply.
ThreadRunner
has limitations too - e.g. matplotlib
does not work with it, since this package is thread-unsafe. That's a limitation in my use case since the whole point of moving away from SequentialRunner
is to parallelize nodes that generate big partitioned datasets of plt.Figure
s.ParallelRunner
does not with spark
- that I get. So the fact that it's not able to run pipelines involving SparkDataset
or SparkHiveDataset
is clear. But the problem I described is a bit different: if your catalog has any spark
datasets, ParallelRunner
cannot be used in even pipelines that have nothing to do with those catalog datasets.On providing the repo - I'm not sure unfortunately I'll have time for that in the near future, but will post here if I manage to.
Got it, would this resolved if dataset is somehow lazy initialised?
Got it, would this resolved if dataset is somehow lazy initialised?
Yeah lazy initialization would resolve this. That's my understanding since if I comment out those datasets, it works fine.
Does Kedro support lazy initialization somehow?
Kedro-datasets is lazily import but I think during the initialisatio Data Catalog would create instance for the entire catalog.
This seems to be related to https://github.com/kedro-org/kedro/issues/2829
Indeed, closing this as duplicate of #2829, they are the same problem.
Description
Using
ParallelRunner
puts some restrictions on datasets involved in the run, as the logs mention:Having this constraint on datasets that are involved in the pipeline that's executed with
ParallelRunner
makes total sense.However, I found out that if any dataset in the catalog doesn't adhere to this, usage of
ParallelRunner
becomes impossible even for pipelines that have nothing to do with those datasets.In other words, the following raises
AttributeError: The following data sets cannot be used by multiprocessing...
:Context
This error prevents leveraging amazing advantages that
ParallelRunner
can bring to large projects in cases where any dataset doesn't adhere to the runner's requirements.Steps to Reproduce
ParallelRunner
requirements, but runs fine withSequentialRunner
. Let this pipeline have 2 outputs: e.g.pandas
dataframes.profiling
of those tables: likedf.describe()
. There should be a modular pipe and 2 namespaces pipelines created for 2 tables respectively.SequentialRunner
and produce those 2 outputs.ParallelRunner
, since it should be able to process those 2 namespaces in parallel, and see error raised.Expected Result
The second pipeline involves no datasets that do not adhere to
ParallelRunner
requirements, and should be executed without errors. It should not check requirements for datasets not involved in it.Actual Result
ParallelRunner
raisesAttributeError: The following data sets cannot be used by multiprocessing...
on datasets not involved in--pipeline
being runYour Environment
pip show kedro
orkedro -V
): 0.19.3python -V
): 3.10