Extend Execution Stages with a new Level - Execution Batches

mrpaulandrew commented 4 years ago

Building on community feedback in the following issues:

https://github.com/mrpaulandrew/procfwk/issues/61 https://github.com/mrpaulandrew/procfwk/issues/62

Allow the framework to support multiple executions of the parent pipeline using a new higher level concept called Batches.

A batch sits above execution stages.
Batches can be linked to 1/many execution stages.
Be enabled/disabled via properties.
Batches can run concurrently. Batch 1 = Running, Batch 2 = Running.
A single batch ID cannot run concurrently.

Examples of Batch names:

Hourly
Daily
Weekly
Monthly

Finally, it is expected that a Trigger hitting the parent pipeline will have the batch name included.

Mallik-G commented 4 years ago

Hi @mrpaulandrew,

I too had a similar requirement to support multiple parallel executions. I approached it similarly except that I called the higher-level concept as "Job".

A Job is comprised of one or more stages. Multiple jobs can run concurrently. JobId is passed a parameter from the Parent pipeline's trigger and is made available to all lower levels. A combination of pipelines linked to all stages in a job is inserted into LocalExecution. Once the job is successful, logs related to a job are moved to ExecutionLog

Metadatadb changes include

changes to Stages table to support Job and Stage Linking.
added JobId as a parameter to stored procs where needed.
few properties like OverrideRestart, FailureHandling are more job-specific, so moved to a new object named JobProperties table.
framework level properties that will be common across jobs are kept in Properties Table.
related procs and functions are also developed

Examples of Jobs:

Daily Sales Ingestion Job (JobId 1)
- Ingestion stage (StageId 1)
  - CopyToDataLake (Pipeline 1)
- Transformation Stage (StageId 2)
  - LaunchDatabricksJob (Pipeline 2)
- Load Stage (StageId 3)
  - CopyToDW (Pipeline 3)
Daily Marketing Egression Job (JobId 2)
- Prepare Stage (StageId 4)
  - LaunchDatabricksJob (Pipeline 4)
- Extract Stage (StageId 5)
  - CopyToSFTP ((Pipeline 5)
Weekly Campaign Ingestion Job (JobId 3)
- Ingestion stage (StageId 6)
  - CopyToDataLake (Pipeline 6)
- Transformation Stage (StageId 7)
  - LaunchDatabricksJob (Pipeline 7)
- Load Stage (StageId 8)
  - CopyToDW (Pipeline 8)

I didn't have much use for Grand Parent at my work setup. So, my idea was to have the ParentPipeline to have different triggers (schedule or tumbling) and to pass the JobId as a parameter to the pipeline

When a worker pipeline is reused in a stage within the same job or across jobs, for example, LaunchDatabricksJob is a pipeline that gets reused a lot at my work where I need to pass different invocation params, class name and jar name as well to submit different apps/jars on to Databricks cluster. To do that, I just add a new entry into Pipelines table mapping the Stage and Pipeline (like a new pipeline instance) and that causes a new PipelineId to be created making this particular pipeline instance unique. This PipelineId will be used in the Pipeline Parameters table to associate parameters needed for that particular pipeline id (or pipeline instance).

Examples of Jobs:

DailyTrigger-SalesIngestion => Scheduled At 3 AM Daily in AUS Time Zone => Parameter: JobId = 1

DailyTrigger-MarketingEgression => TumblingWindow trigger Scheduled at 4 AM Daily UTC => Parameter: JobId = 2

WeeklyTrigger-CampaignIngestion => Scheduled At 3 AM On Mondays in AUS Time Zone => Parameter: JobId = 3

I had this as a work-in-progress project in my local for some time, but after looking at recent conversations around this issue, I have pulled recent changes from the master branch, applied my changes (only metadata database changes) and pushed them to my fork. Planning to work on ADF Pipeline changes tonight and give a test end to end

Please have a look at the changes and let me know if it helps - https://github.com/Mallik-G/ADF.procfwk/tree/feature/mallik

Thanks, Mallik

NJLangley commented 4 years ago

@mrpaulandrew I have pushed my implementation of this onto my fork here: https://github.com/NJLangley/ADF.procfwk/tree/feature/batches

There are a few outstanding bits to fix:

Removing the stage from the pipelines table, it's now on the BatchPipelineLink table
Sorting the check to ensure that two batches with the same pipeline/params cannot run at the same time. Maybe we could do something clever here to see if another batch is running the pipeline with exactly the same params and wait for that to finish, instead of firing it off again?
Updating the sample data procs, I'll try and get that sorted in the next few days

Let me know if you have questions / any issues testing it

mrpaulandrew / procfwk

Extend Execution Stages with a new Level - Execution Batches #72