pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

Add group nodes for structuring flows, as task ordering barrier, and for clusting tasks in visualization #182

Closed windiana42 closed 3 months ago

windiana42 commented 3 months ago

Checklist

windiana42 commented 3 months ago

This PR introduces baseline testing for visualizations: https://github.com/pydiverse/pydiverse.pipedag/blob/group_nodes/tests/util/baseline.py It is used here: https://github.com/pydiverse/pydiverse.pipedag/blob/group_nodes/tests/test_run_group_node.py#L190

This means it stores a copy of the visualization URLs of flows in the repo and compares the actual output to that baseline in the repo. In case of changes a test_run_group_node.updated.json file is created but needs to be copied manually to test_run_group_node.json to accept the changes.

NMAC427 commented 3 months ago

What exactly is the purpose of the barrier nodes?

windiana42 commented 3 months ago

What exactly is the purpose of the barrier nodes?

There are cases where it is easier to explicitly ask for tasks being executed in the order of declaration irrespective of explicit argument based dependencies. The idea was actually born while working on #181. A common usecase there is to declare a stage validator task which is executed at the end of preparing a stage. Thus the associated decorator @input_stage_versions could be made execute after all @materialize tasks within the stage unless explicit dependencies demand a different ordering. However the @input_stage_versions decorator can also be used to copy filtered input into a stage. Thus, in rare cases, one might want to execute it first. This can be done with such a barrier. Among current pipedag users, the idea that task order can be controlled more explicitly was very well received and also demanded in some unrelated discussions.

windiana42 commented 3 months ago

@NMAC427 we discussed a lot of different ways for users to interact with ordering task execution. The barrier concept seems to be the nicest of them. We were thinking about ways of specifying levels which would influence ordering. I am actually considering to abolish all other ways of manually specifying ordering except for argument based dependencies and barrier groups.

windiana42 commented 3 months ago

@NMAC427 @NicolasMuellerQC @DominikZuercherQC I plan to merge this PR early tomorrow in order to release 0.9.0.