Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
ParallelRunner allows users to run their program with multiprocessing with one extra argument and no extra code. In reality, this is rarely used. Should we continue developing it or lower the priority? We have many discussion about Runner, so the issue is created for facilitating and documentation mainly.
The ecosystem has evolved and solving the multiprocessing in their own way (I think pandas is still lagging behind, but polars kinda solved it)
ParallelRunner solve a subset of multiprocessing, to solve this realistically, user will need finer grain control and the current - ParallelRunner fails to do it. i.e. GPU specific workflow cannot be multi-process, you want the GPU training happen on one specific process while other process handle other non-GPU node (this sounds a bit familiar to the "group node" deployment problem but affect local development too)
3094
On the other hand:
async / CacheDataset` or kedro-accelerator seems to be a more practical way to speed up Kedro. I am not very up to date about async myself, maybe it's worth to put more effort on these instead of fixing ParallelRunner
Developement:
We had some discussion of using async to rewrite the Runners before to simplify the codebase. It's unclear yet what extra benefit do we get since we haven't discussed in details.
Description
ParallelRunner
allows users to run their program withmultiprocessing
with one extra argument and no extra code. In reality, this is rarely used. Should we continue developing it or lower the priority? We have many discussion aboutRunner
, so the issue is created for facilitating and documentation mainly.I dump a question in slack recently to see how the community thinks about it: https://linen-slack.kedro.org/t/16663577/do-you-use-kedro-run-runner-parallerunner-to-speed-up-your-p#99abccb0-7970-4a65-8fad-85fd22681beb
The ecosystem has evolved and solving the multiprocessing in their own way (I think pandas is still lagging behind, but polars kinda solved it)
3094
On the other hand:
Developement:
async
to rewrite the Runners before to simplify the codebase. It's unclear yet what extra benefit do we get since we haven't discussed in details.kedro
, it can be installed in PyPi https://pypi.org/project/kedro-softfail-runner/.