kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.48k stars 874 forks source link

Allow for specifying extra node dependencies #3988

Open lvijnck opened 3 days ago

lvijnck commented 3 days ago

Description

I've always felt like Kedro misses the ability to specify additional dependencies among nodes, which are not dataset related.

Context

For instance, consider the problem of filling a knowledge graph though Kedro. Obviously, there's two main nodes:

  1. Write nodes
  2. Write edges

However, the edges cannot be written before the nodes were pushed. There is hence no "dataset" dependency between the nodes, but rather an execution dependency.

Possible Implementation

Adding this to Kedro would involve 1) addition to the node system and 2) and update to the topological execution mechanism. With respect to the nodes, dependencies could be specified as follows:

def create_pipeline(**kwargs) -> Pipeline:
    """Create embeddings pipeline."""
    return pipeline(
        [
            node(
                func=write_nodes,
                inputs=[
                    "int.nodes"
                ],
                outputs="prm.nodes",
                name="write_nodes",
            ),
            node(
                func=write_edges,
                inputs=[
                    "int.edges"
                ],
                outputs="prm.edges",
                name="write_edges",
                dependencies=["write_nodes"]
            )
       ]
  )

Possible Alternatives

The current work-around is to add "artificial" dataset dependencies among the nodes. This has the drawback that the function signatures of those nodes are polluted.

datajoely commented 3 days ago

Hey @lvijnck good to see you pop up here 👀 congrats on the new role!

The current way to do this is to pass a dummy dataset between the nodes to coerce the DAG into the right shape.

There are some open proposals on a more explicit mechanism of defining the DAG order. #1156 , I'm 99% @noklam has a concrete design somewhere, but I can't find it

datajoely commented 3 days ago

This was the issue (now discussion) I was looking for, @lvijnck if you have any further thoughts please add them there as it really helps prioritise things

https://github.com/kedro-org/kedro/discussions/3758