mrpaulandrew / procfwk

A cross tenant metadata driven processing framework for Azure Data Factory and Azure Synapse Analytics achieved by coupling orchestration pipelines with a SQL database and a set of Azure Functions.
https://mrpaulandrew.com/category/azure/data-factory/adf-procfwk/
Other
179 stars 114 forks source link

Dependency Within a Stage #112

Open gauravjhunjhunwala opened 2 years ago

gauravjhunjhunwala commented 2 years ago

There are use cases where a pipeline may be dependent on another pipeline within the same stage. For such use cases it would be easier to set a dependent pipeline within the same stage. Another use case I can think of is when pipelines should be executed sequentially within the same stage. If my understanding of framework is correct we cannot have dependent jobs in the same stage.

jamclaug commented 2 years ago

Doesn't this fly in the face of the definition of a stage? Stages are executed sequentially, with each pipeline in a stage executing in parallel. If you are using batch execution, then the batches control which stages execute, but they still execute sequentially.

If you have a dependency chain of pipeline executions, each pipeline on that chain should execute in it's own stage. Multiple unrelated dependency chain steps can of course execute in parallel within the same stage. You can then use the dependencies (with the proper Properties set) to enforce the dependency.

for example, with pipelines A, B, C and D, with dependencies A->B and C->D:

Without batch execution: Stage 1 - Pipelines A and C
Stage 2 - Pipelines B and D Dependencies are set that B is dependent on A and D is dependent on C.

With batch execution, two batches: Batch X - Stage 1 and Stage 2 Batch Y - Stage 3 and Stage 4 Stage 1 - Pipeline A Stage 2 - Pipeline B Stage 3 - Pipeline C Stage 4 - Pipeline D Dependencies are set that B is dependent on A and D is dependent on C.

garethadvancing commented 2 years ago

This stands if the presumption is that a stage can only ever be defined by the dependency chain, whereas, in practice, businesses would likely rather collapse these into a more descriptive collection - i.e., if following the lakehouse architecture, bronze - silver - gold, where interlinked dependencies in gold would be common. You also wouldn't want to get to the point where you have 10s of meaningless stages that are simply organised that way because the dependencies force them to be. The simple fix to this would be to introduce sub-stages, calculating where in a chain a pipeline should fit - 1.1, 1.2, 1.3, 2.1, 2.2 etc... This achieves the same thing functionally but gives far more flexibility in stage naming and collecting pipelines into meaningful groups.

jamclaug commented 2 years ago

I think there is a divide here between logical organization and implementation details. Sub-stages or whatever, the stage IDs are an implementation detail. An example from our implementation:

image

As you can see, the StageName provides the necessary Meta-Metadata (?!) that describes the LOGICAL relationships between the stages, where the stage id is the value that is associated with the Batches. Doing it in this way allows me to easily see at a glance what batches are executing and the status of those executions. The name could be anything - 'COPA - Bronze Stage 1' etc. according to the LOGICAL organization of the stages.

We are transitioning to a Data Mesh inspired architecture, with logical separation into source and consumer domains. Part of that transition will be changing how batches are executed so that each batch will consist of the pipelines needed to process a single Source Domain or Consumer Domain. Thus, the stage structure will be renamed accordingly, eg. 'Customer Management - Bronze Stage 1', assuming we will use the medallion architecture (which is another topic altogether).

The actual implementation of those stages in the orchestration system should be done in the way that makes the most sense for the implementation. SQL notebooks and Power BI reports can be created that use the logical organization to make operational maintenance easier.

This is just my experience, so please feel free to correct me if I am off base here. I hope this is helpful.

garethadvancing commented 2 years ago

Honestly, I haven't used this framework in a while and the base version of this framework in even longer, so I won't speak in specifics to this framework, but, practically, the dependencies should be completely contained in the dependency table that's set up and stages should be descriptive.

In your example you clearly have 3 stages, Copa, Capella and Teknisk.

The table you've then shown, showing the stage "number", should be derived at runtime, you shouldn't be needing to manually specify a substage number such as COPA - 3, based on your dependencies, at runtime, the platform should work out you have 3 nested dependencies in the COPA stage and organise these into 3 loops on its own. That's how our platform is set up, so if more dependencies are added and this creates a COPA 4, 5, 6 and 7, that this is done at runtime and isn't necessary to manually specify.

Which loops back round to the original comment, suggesting this form of functionality, so a stage can exist regardless of how many nested dependencies it may have.