kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Ability to set a custom runner globally #3255

Open jptissot opened 10 months ago

jptissot commented 10 months ago

Description

I have built a custom runner and configuration loader. It would be useful to be able to specify the default runner to avoid having to set it on every run.

Context

We have multiple kedro projects that share code and rely on a custom runner that customizes the dataset loading behavior for nodes. Our pipelines simply won't work with any other runner.

noklam commented 10 months ago

Can you explain what does this custom runner do? Do you mean that setting the default runner in settings.py?

Would be great if you can share how you would like it to work.

noklam commented 10 months ago

Can you explain what does this custom runner do?

Would be great if you can share how you would like it to work.

jptissot commented 10 months ago

We are using Spark, DeltaTables and Azure Blob storage. I have been using Kedro for a while and now experimenting with passing the Dataset itself instead of the loaded data, to be able to customize the behavior inside of nodes. Basically, instead of doing dataset.load and pass that to the nodes, I pass the dataset itself and perform actions inside the nodes. I also have custom versions of the SparkDataset and DeltaDataset.

The fact that most "save" actions are done on the Delta table and that "Read / Load" actions are done on the SparkDataset actually works in our favor in this case. The pipeline that manages data for Dataset A get the DeltaDataset A as an input and output the SparkDataset A. The pipelines / nodes that consume the data will just use the SparkDataset A as an input. Note that sometimes, SparkDataset A is not processed, but nodes that depend on it can still be run. In that case, the data should be loaded from the blob storage. SparkDataset is a read only dataset (with caching) in our case.

We also have custom behavior that will instanciate an empty dataset if the table does not exist and a spark Schema is specified. This way, it is possible to run some nodes that depend on data that was never processed, but still provide meaningful outputs.

Note that our usage of kedro is advanced. We basically have multiple "integrations" that get data from some api's or files and land that data in the Blob store. This data is then processed in a kedro project (one project per integration) using spark to produce cleaned outputs. These outputs are then leveraged in other kedro projects.

takikadiri commented 10 months ago

We have the same issue, that we currently tackle by instantiating our custom runner in a custom cli that we inject through plugin framework.

In settings.py: We set the runner class and potentielly the runner args In cli.py: We import project settings and instantiate the runner object using runner class and runner args. The custom runner object will then be used by the KedroSession

I don't know if there is another way of doing that, but it's indeed a real need for some advanced usage.

There is another issue that prevent using some custom runners natively in kedro, is that kedro cli doesn't support runner class arguments. The --runner take only the class then the cli instatiate the class without having a way to inject some init arguments.

jptissot commented 10 months ago

In settings.py: We set the runner class and potentielly the runner args In cli.py: We import project settings and instantiate the runner object using runner class and runner args. The custom runner object will then be used by the KedroSession

Thanks for this. I haven't played with cli.py yet. I will investigate if this could work as a stop gap.

Having to set settings using settings.py and cli.py becomes repetitive when multiple projects share the same configuration. It would be much simpler if the settings used some sort of builder pattern that allow us to configure the settings of a project programmatically instead of relying solely on some "magic" in a settings.py file.

takikadiri commented 10 months ago

For the cli part you can set it globally in a plugin (python package) that you add to you projects dependencies.
In setup.py of your plugins, you can add this entry point, so all kedro projects will automatically use your custom cli when executing kedro run

   "entry_points": {
        "kedro.project_commands": ["cli = <you_plugin_package>.<optional_module>:<cli_name>"],
}
noklam commented 10 months ago

Having to set settings using settings.py and cli.py becomes repetitive when multiple projects share the same configuration. It would be much simpler if the settings used some sort of builder pattern that allow us to configure the settings of a project programmatically instead of relying solely on some "magic" in a settings.py file.

The original request of having a global setting for runner is reasonable to me, but I am a bit confused about the discussion of a programmatic way to configure project.

Assuming you can set runner globally, isn't it just a single line configuration in settings.py? Can you explain how a builder pattern would make this easier to work for multiple project?

The default way to share code among projects is building a plugin. Kedro expose certain entrypoints and hooks that you can use, the above is a good example.

noklam commented 10 months ago

To continue the origin request, I think we need to expose the runner class in settings.py.

There is another issue that prevent using some custom runners natively in kedro, is that kedro cli doesn't support runner class arguments. The --runner take only the class then the cli instatiate the class without having a way to inject some init arguments.

This is not really a problem to your runner case. At the moment you can either choose to have a plugin that expose its own run function. There is an open ticket to explore around the idea of adding addition arguments, the idea is to use similar system like pytest. i.e. pytest-cov