kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 875 forks source link

`ParallelRunner` raises `AttributeError: The following data sets cannot be used by multiprocessing...` on datasets not involved in `--pipeline` being run #3804

Closed yury-fedotov closed 1 month ago

yury-fedotov commented 3 months ago

Description

Using ParallelRunner puts some restrictions on datasets involved in the run, as the logs mention:

In order to utilize multiprocessing you need to make sure all data sets are serialisable, i.e. data sets should not make use of lambda functions, nested functions, closures etc.
If you are using custom decorators ensure they are correctly decorated using functools.wraps().

Having this constraint on datasets that are involved in the pipeline that's executed with ParallelRunner makes total sense.

However, I found out that if any dataset in the catalog doesn't adhere to this, usage of ParallelRunner becomes impossible even for pipelines that have nothing to do with those datasets.

In other words, the following raises AttributeError: The following data sets cannot be used by multiprocessing...:

kedro run --pipeline pipeline_that_doesnt_involve_problematic_datasets runner=ParallelRunner

Context

This error prevents leveraging amazing advantages that ParallelRunner can bring to large projects in cases where any dataset doesn't adhere to the runner's requirements.

Steps to Reproduce

  1. Create a pipeline that uses datasets not adhering to ParallelRunner requirements, but runs fine with SequentialRunner. Let this pipeline have 2 outputs: e.g. pandas dataframes.
  2. Create a second pipeline that does some profiling of those tables: like df.describe(). There should be a modular pipe and 2 namespaces pipelines created for 2 tables respectively.
  3. Run the first pipeline with SequentialRunner and produce those 2 outputs.
  4. Try running the second pipeline with ParallelRunner, since it should be able to process those 2 namespaces in parallel, and see error raised.

Expected Result

The second pipeline involves no datasets that do not adhere to ParallelRunner requirements, and should be executed without errors. It should not check requirements for datasets not involved in it.

Actual Result

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run

Your Environment

noklam commented 3 months ago

Haven't read the full thing. Was this working prior 0.19? In general we recommend ThreadRunner because multiprocess doesn't work with Spark. The computation doesn't happened locally anyway so it does not make sense to use multiprocess.

Would you be able to provide an demo repository that we can run on other side? Something modify from the existing starter would be good enough.

yury-fedotov commented 2 months ago

@noklam Hey! Sorry for late reply.

  1. I haven't tested in < 0.19 tbh.
  2. ThreadRunner has limitations too - e.g. matplotlib does not work with it, since this package is thread-unsafe. That's a limitation in my use case since the whole point of moving away from SequentialRunner is to parallelize nodes that generate big partitioned datasets of plt.Figures.
  3. ParallelRunner does not with spark - that I get. So the fact that it's not able to run pipelines involving SparkDataset or SparkHiveDataset is clear. But the problem I described is a bit different: if your catalog has any spark datasets, ParallelRunner cannot be used in even pipelines that have nothing to do with those catalog datasets.

On providing the repo - I'm not sure unfortunately I'll have time for that in the near future, but will post here if I manage to.

noklam commented 2 months ago

Got it, would this resolved if dataset is somehow lazy initialised?

yury-fedotov commented 2 months ago

Got it, would this resolved if dataset is somehow lazy initialised?

Yeah lazy initialization would resolve this. That's my understanding since if I comment out those datasets, it works fine.

Does Kedro support lazy initialization somehow?

noklam commented 2 months ago

Kedro-datasets is lazily import but I think during the initialisatio Data Catalog would create instance for the entire catalog.

merelcht commented 1 month ago

This seems to be related to https://github.com/kedro-org/kedro/issues/2829

astrojuanlu commented 1 month ago

Indeed, closing this as duplicate of #2829, they are the same problem.