odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
809 stars 261 forks source link

New OM Type of Process Template #1576

Closed cong78 closed 4 years ago

cong78 commented 5 years ago

During the Egeria August workshop we have discussed the need for new asset type of Process Template. This new type can be used for data engines to provide processing or transformation templates that can be used to design or develop data processing pipelines or ETL jobs.

Taking Apache NiFi (https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#building-dataflow ) as an open source project example, they provide many Processors as basic components to build up a data flow pipeline. Each processor has its own configuration properties as well as functionality scenarios. When a user is designing a data flow pipeline, they can configure the existing processor with input and output ports as a part of the flow.

From Open Metadata perspective, when we are registering the data flow pipeline from the Apache NiFi as a Data Engine client to Egeria, we can save the processors from it as Process Template entities so that they are visible from metadata visibility and management perspectives. When someone is designing the ETL job or data processing pipeline, he or she will use that Process Template to create single or multiple processes.

Screenshot 2019-09-17 at 13 41 53 I have made a draft based on my understanding about this new type within area of 0217 Automated Process. The reason is I consider the Process template can be considered as a part of Process Automation without any human creations for specific technical purpose. And it is a sub type of Asset with a relationship to type Process that processes can be created based on the Process Templated offered by Data Engines.

@cmgrote @mandy-chessell @popa-raluca As this is just my initial draft based on the inputs that I got. Please let me know your thoughts and ideas about it. ;)

Thanks,

Cong

cmgrote commented 5 years ago

Hi Cong -- couple questions / comments:

cong78 commented 5 years ago

Hi Christ thanks for replying. Here are some thoughts from me.

  • Another scenario for ProcessTemplate, I think, was for it to represent a sort of process (ie. composition of multiple sub-processes) that simply was not in an executable state (ie. was not "bound")...

Yes I agree with point and I think it was also discussed as one of the reason for this new Type during the last workshop.

  • Assuming this is still the case, would we expect ProcessTemplate to be materially (properties-wise) different from Process? If not, I would sort of think one might extend the other type-wise: perhaps Process extends ProcessTemplate (?)

I was also thinking about it and one question that I got is if we save the DataStage unfinished jobs as process template entities with relationships to ports in Egeria, what would happen if the job is finished (after the run button I guess)? Will a new process entity is going to replace an old process template entity?

  • The ProcessByTemplate relationship should probably also only be 0..1 cardinality on the ProcessTemplate end (that is: a Process can presumably only be the executable form of a single ProcessTemplate rather than multiple; though indeed you could have multiple executable Processes that were derived from a single ProcessTemplate so the other end should likely remain *)

I drew relationship's cardinality as *..* because I saw one of the drawing documentation that Raluca sent me has different levels of processes from most granular to most abstract. So I had an assumption that a very high level process could probably represented by multiple (sub)processes and they will probably use a same process template entity during the process design?

cong78 commented 5 years ago

The New Type drawing from @cmgrote 's proposal?

Screenshot 2019-09-18 at 10 41 04
mandy-chessell commented 5 years ago

I think it is important to separate the idea of a process template from somthing that is executable - so a process should not inherit from a process template. (IE remove the triangle arrow head :)

The relationship could be called ProcessTemplateImplementation.

The ProcessTemplate belongs in Area 5 - see 0575

mandy-chessell commented 5 years ago

Another thought - the ProcessTemplate should probably inherit from Referenceable rather than Asset. This is what we have done for reference data (valid value set) and model element

cmgrote commented 5 years ago

I think it is important to separate the idea of a process template from somthing that is executable - so a process should not inherit from a process template.

I agree that they should be distinct concepts, but what about the large overlap that's likely there in terms of the relationships that are valid across both? I would expect a ProcessTemplate could be composed of sub-ProcessTemplates (just as Processes can be composed of sub-Processes), or even that it might be possible for a ProcessTemplate to be defined by the composition of multiple underlying Processes... I think a ProcessTemplate should also still be able to relate to ProcessPorts, too? Similarly, perhaps relations from other areas (like GovernanceProcessImplementation) should logically be able to relate to either a Process or a ProcessTemplate (?)

(So I'm perhaps leaning towards a ProcessTemplate being a specialised form of (extending from) Process?)

cong78 commented 5 years ago

Another thought - the ProcessTemplate should probably inherit from Referenceable rather than Asset. This is what we have done for reference data (valid value set) and model element

I have made a drawing based on moving ProcessTemplate to area 0575 ProcessSchema.

Screenshot 2019-09-18 at 13 48 55

To clarify my understanding, we are talking about two scenarios about this new type. First one is Process Template as a separate type extending from Referenceable. Second one is process type is extending the new type of process template.

Screenshot 2019-09-18 at 13 37 21
mandy-chessell commented 5 years ago

Sorry @cmgrote, I do not understand the second scenario. I remember that you talked about a partially completed process. This seems different from a ProcessTemplate. I would say that is a Process that has a status of DRAFT?

cmgrote commented 5 years ago

The scenarios I had in mind were along these lines:

  1. A process that is fully-defined (inputs, outputs and activities in the middle), that does not have variables / placeholders -- everything is "hard-coded" in the process itself. It is executable and does not matter yet if it has been executed or not, as its inputs / outputs are not changed by execution -- its representation (in a lineage sense) is the same.
  2. A process that is fully-defined (inputs, outputs and activities in the middle), but has variables / placeholders that are not defined until it is executed. It is executable but not yet executed. Once executed it may result in different representations (in a lineage sense), as execution may change inputs / outputs based on the values of the variables provided at runtime.
  3. A process that has executed. Irrespective of starting at (1) or (2) it has values for all placeholders by the very nature of it having been executed.

I thought we had discussed that for (2) this would be a ProcessTemplate, and (1) and (3) would be Processes, but maybe I'm mistaken?

(If (2) were a ProcessTemplate then I'd still need to be able to define the ProcessPorts for its inputs / outputs, potentially a hierarchy of Processes (and / or mixed with other ProcessTemplates) for more granular parts of that process, etc.)

cong78 commented 5 years ago

Probably the discussion is about the definition of the ProcessTemplate and how it is being implemented in different types of data engines. Because I have never worked with DataStage so I cannot give too much insights on that side. But for Apache NiFi, there is a concept named Dataflow Template. And the definition is like this: Apache NiFi provides users the ability to build very large and complex DataFlows using NiFi. This is achieved by using the basic components: Processor, Funnel, Input/Output Port, Process Group, and Remote Process Group. These can be thought of as the most basic building blocks for constructing a DataFlow. At times, though, using these small building blocks can become tedious if the same logic needs to be repeated several times. To solve this issue, NiFi provides the concept of a Template. A Template is a way of combining these basic building blocks into larger building blocks. Once a DataFlow has been created, parts of it can be formed into a Template. This Template can then be dragged onto the canvas, or can be exported as an XML file and shared with others. Templates received from others can then be imported into an instance of NiFi and dragged onto the canvas. https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates

Here is an example that I can think of how can we define processTemplate:

In these pic there are four phases and the phase 2 Create Table can be exported as a Template. The Create Table Template then can be a reusable component for any people who need to do similar task and import it in their ETL or data processing workflows. It is design-oriented with configuration options as you can change the properties under each processors, such as for ConvertAvroToJson you can add the details of the Name and Version of the Avro file that you want to consume. (image) [https://blog.pythian.com/database-migration-using-apache-nifi/]

So from my understanding, there are two definitions: ProcessTemplateSchema (my imagined name) :

For ProcessTemplateSchema we treat it as a referencable in Area 0575 ProcessSchema. For ProcessTemplate we treat it as a super type of Process with configuration possibilities.

Not sure if these make sense but I feel Apache NiFi is a good example for me to think through it. @mandy-chessell @cmgrote Please let me know if you have any ideas.

cmgrote commented 5 years ago

I'm wondering if the first thing you're talking about (the ProcessTemplateSchema) is what we should be using as the meaning for a ProcessTemplate -- and the latter is still just a Process.

I would distinguish the two as:

I'm now seeing ProcessTemplate as being some (optional) portion of a Process, but not something that could be a Process on its own. (DataStage has a similar concept in what it calls a "shared container": re-usable block of logic, which could have various sub-blocks of logic, but none of it can be executed until it is put into a job (Process).)

This would mean my 3 scenarios outlined in my previous comment would all be Processes.

mandy-chessell commented 5 years ago

We need to distinguish between the type of something - which is true for ite lifetime - and a change in state - ie moving from partial to complete. I would think that when a process template is used, it is copied into a process and then the process can be customized?

planetf1 commented 5 years ago

As per Mandy's comment above - templates in nifi appear to be more used to get developers started, for export/import, sharing .. So they are indeed copied and modified rather than used as a 'module' - ie reusable logic without modification. Not that different to cut/paste or visual copy, just quicker..

So although we have a set of process templates which it's useful to understand & catalog, their relationship to process is only through the design process. it's a loose coupling 'was used to create', like in a single git commit. The actual process definition could become completely different - delete all nodes and recreate for example.

Maybe we therefore have two separate things here

cong78 commented 5 years ago

hi @cmgrote @mandy-chessell @planetf1 thanks for your ideas.

By combining what we are saying, the process template is a re-usable piece of a logic or few sub-logics from data engines that can help engineers to design the process. It can be copied or imported during the process design and also can be customised depending on the actual purpose of the process. It can not be executed until it is being put into a process job.

Would it be something that we can agree?

cmgrote commented 5 years ago

Is the comment above an accurate reflection of our discussion in Huizen? If not, suggest we need to urgently capture it as I think I've already forgotten 🙁

cong78 commented 5 years ago

Is the comment above an accurate reflection of our discussion in Huizen? If not, suggest we need to urgently capture it as I think I've already forgotten 🙁

Sorry I should have written down the conclusions just during the workshop. There are the things that I wrote down for the discussions/conclusions that we had :

@mandy-chessell Could you please have a final look?

cong78 commented 5 years ago

Hi @mandy-chessell I created this issue initially thought we might use it for reusable design components in data engines. And after the workshop you gave us a slightly different explanation on it. Just wondering do we have enough knowledge to conclude this new type or we need more valid use case to prove it?

Otherwise I will set milestone tag to a bit later release.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.