kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 879 forks source link

Rethink how Kedro can play a role in multiprocessing / performance boost #3713

Open noklam opened 4 months ago

noklam commented 4 months ago

Description

ParallelRunner allows users to run their program with multiprocessing with one extra argument and no extra code. In reality, this is rarely used. Should we continue developing it or lower the priority? We have many discussion about Runner, so the issue is created for facilitating and documentation mainly.

I dump a question in slack recently to see how the community thinks about it: https://linen-slack.kedro.org/t/16663577/do-you-use-kedro-run-runner-parallerunner-to-speed-up-your-p#99abccb0-7970-4a65-8fad-85fd22681beb

The ecosystem has evolved and solving the multiprocessing in their own way (I think pandas is still lagging behind, but polars kinda solved it)

On the other hand:

Developement: