kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.85k stars 893 forks source link

ParalleRunner hangs on Linux Server #4176

Open Dekermanjian opened 3 days ago

Dekermanjian commented 3 days ago

Description

I have a pipeline that I would like to run using the ParallelRunner. When I run this pipeline on my local windows machine it works just fine. However, when I try running the exact same pipeline on a Linux server (Rocky Linux) it will just hang at the loading datasets stage.

noklam commented 3 days ago

Can you provide some more context, if possible to share a simplified version of repository that we can try to reproduce locally.

Dekermanjian commented 3 days ago

@noklam Yeah, of course. Let me try to put together something simple that will hang on the server and then I'll share the repo with you.

Dekermanjian commented 2 days ago

@noklam Okay, I figured out why it is not working. I just don't understand why it doesn't work on Linux but it does on Windows. Here is a simple example: https://github.com/Dekermanjian/test-parallel-runner

The reason it is not working on the linux server is because I am loading a parquet file in my settings.py file. When I load that file in the simple example the ParallelRunner will hang at the loading dataset stage. If you comment that line out (line 6) then it will work. You can generate the data by running the notebook I created.

Sorry let me add the command to run: kedro run --runner=ParallelRunner -p data_processing