mrpaulandrew / procfwk

A cross tenant metadata driven processing framework for Azure Data Factory and Azure Synapse Analytics achieved by coupling orchestration pipelines with a SQL database and a set of Azure Functions.
https://mrpaulandrew.com/category/azure/data-factory/adf-procfwk/
Other
185 stars 116 forks source link

Extend Execution Stages with a new Level - Execution Batches #72

Closed mrpaulandrew closed 4 years ago

mrpaulandrew commented 4 years ago

Building on community feedback in the following issues:

https://github.com/mrpaulandrew/procfwk/issues/61 https://github.com/mrpaulandrew/procfwk/issues/62

Allow the framework to support multiple executions of the parent pipeline using a new higher level concept called Batches.

Examples of Batch names:

  1. Hourly
  2. Daily
  3. Weekly
  4. Monthly

Finally, it is expected that a Trigger hitting the parent pipeline will have the batch name included.

Mallik-G commented 4 years ago

Hi @mrpaulandrew,

I too had a similar requirement to support multiple parallel executions. I approached it similarly except that I called the higher-level concept as "Job".

A Job is comprised of one or more stages. Multiple jobs can run concurrently. JobId is passed a parameter from the Parent pipeline's trigger and is made available to all lower levels. A combination of pipelines linked to all stages in a job is inserted into LocalExecution. Once the job is successful, logs related to a job are moved to ExecutionLog

Metadatadb changes include

Examples of Jobs:

I didn't have much use for Grand Parent at my work setup. So, my idea was to have the ParentPipeline to have different triggers (schedule or tumbling) and to pass the JobId as a parameter to the pipeline

When a worker pipeline is reused in a stage within the same job or across jobs, for example, LaunchDatabricksJob is a pipeline that gets reused a lot at my work where I need to pass different invocation params, class name and jar name as well to submit different apps/jars on to Databricks cluster. To do that, I just add a new entry into Pipelines table mapping the Stage and Pipeline (like a new pipeline instance) and that causes a new PipelineId to be created making this particular pipeline instance unique. This PipelineId will be used in the Pipeline Parameters table to associate parameters needed for that particular pipeline id (or pipeline instance).

Examples of Jobs:

DailyTrigger-SalesIngestion => Scheduled At 3 AM Daily in AUS Time Zone => Parameter: JobId = 1

DailyTrigger-MarketingEgression => TumblingWindow trigger Scheduled at 4 AM Daily UTC => Parameter: JobId = 2

WeeklyTrigger-CampaignIngestion => Scheduled At 3 AM On Mondays in AUS Time Zone => Parameter: JobId = 3

I had this as a work-in-progress project in my local for some time, but after looking at recent conversations around this issue, I have pulled recent changes from the master branch, applied my changes (only metadata database changes) and pushed them to my fork. Planning to work on ADF Pipeline changes tonight and give a test end to end

Please have a look at the changes and let me know if it helps - https://github.com/Mallik-G/ADF.procfwk/tree/feature/mallik

Thanks, Mallik

NJLangley commented 4 years ago

@mrpaulandrew I have pushed my implementation of this onto my fork here: https://github.com/NJLangley/ADF.procfwk/tree/feature/batches

There are a few outstanding bits to fix:

Let me know if you have questions / any issues testing it