Data Stage connector support for high level lienage

lpalashevski commented 3 years ago

Latest changes in DE OMAS introduce support for high level (asset) lineage mappings. Sample request:

{
    "processes": [
        {
            "qualifiedName": "(process)=CopyColumnsFlow44",
            "displayName": "CopyColumnsFlow44",
            "name": "CopyColumnsFlow44",
            "description": "CopyColumnsFlow describes high level process input and output and mappings between (sub)processes (if any).",
            "owner": "Platform User",
            "lineageMappings": [
                {
                    "sourceAttribute": "(host)=HOST::(data_file_folder)=home::(data_file_folder)=files::(data_file)=names2.csv",
                    "targetAttribute": "(process)=CopyColumnsFlow44"
                },
                {
                    "sourceAttribute": "(process)=CopyColumnsFlow44",
                    "targetAttribute": "(host)=HOST::(data_file_folder)=home::(data_file_folder)=files::(data_file)=emplname4.csv"
                }
            ],
            "updateSemantic": "REPLACE"
        }
    ],
    "externalSourceName": "(organization)=MyCompany::(project)=DataPlatform"
}

We need to investigate the possibility to add this capability to Data Stage connector as well.

cmgrote commented 3 years ago

Looking through the logic of the existing connector, I'm a bit wary of creating an entirely new connector as I suspect that much of the code will overlap (the queries to retrieve the jobs, track the last sync timestamp, configuration options for how much to retrieve, which jobs to include, constructing unique names for the various objects, choosing the attributes of each object to query and retrieve, how these are mapped across to the Egeria OMAS payloads, etc, etc).

I suspect the simplest way to ensure that this overlap can be maintained just once will be to add some configuration options to the existing connector to decide in which "mode" to run it (granular or high-level). This should still allow multiple instances of the connector to be run with different configurations, just as if they were two separate connectors... But it would mean that we don't have the complexity of trying to maintain two separate code areas that have a significant amount of overlap (inevitably causing maintenance headaches and regressions).

We could also create a third module which contains the overlapping pieces, but this is likely to be the most amount of work in the near-term as it will mean creating a new connector, creating the new module containing common code, and then modifying the existing connector to use this new module as well.

So I'd suggest we simply add a "mode" configuration option to the existing connector as a compromise? Keeps maintenance relatively simple, and means we don't have significant up-front work as well. (We can probably limit the complexity of conditional logic by wrapping the "mode"s up into different top-level methods in the connector class, and then just call that top-level method depending on the mode in which it has been configured.)

lpalashevski commented 3 years ago

Agree, the proper way will take way more effort and time. Then +1 for adding configuration parameter 'mode', and as always by default if not set we have the current processing mode.

cmgrote commented 3 years ago

Forgot to ask, but what is the plan for these aspects in this new "job-level" mode?

Sequences: do we still represent these as a higher-level abstraction on top of the jobs?
Ports:
- do we create new PortImplementations that define the inputs / outputs for the jobs (which were previously PortAliases pointing down to the underlying PortImplementation details of individual stages)?
- do we still create PortAliases for the sequence level, that would presumably now refer to the new PortImplementations (above)?
- or do we entirely leave Ports out of this mode of operation (since they really need SchemaTypes anyway, which we won't have)? And if we go this route, do we just use sequences as part of a process hierarchy, but leave out any PortAliases?

lpalashevski commented 3 years ago

I am in favour of the most minimal option that is not including Sequences ( we did not do anything on this level right now in DE, right @popa-raluca ? ) nor Ports. I think we should leaving them out because they will be not used.

My reasoning behind this: In the new "job-level" mode we do not want to get the Implementation level details for the process (what are the stages of job and how they interconnect on schema level) making portImplementations and related schemas obsolete.

Similary for PortAliases, I do not thik we need them without having high level process-to-process mapping like job-to -sqeuence or job to job (never seen job-job in data stage btw).

I suggest to start outputing the minimal set like in the request sample above. (and we can always add sequence level details if we understand that there is use-case for this later on)

popa-raluca commented 3 years ago

Right now DE creates the sequence level processes. They get propagated through AL and stored in OLS, but they are not used in the querying part. I don't think sequences are needed for the new "job-level" mode right now.

cmgrote commented 3 years ago

Per the auto-link above, I've added the initial logic to hopefully allow this mode to be configured. For flow1 of our minimal sample, this produces:

{
    "qualifiedName": "_(host)=INFOSVR::(transformation_project)=minimal::(dsjob)=flow1",
    "displayName": "flow1",
    "description": "",
    "owner": "Administrator IIS",
    "name": "flow1",
    "lineageMappings":
    [
        {
            "sourceAttribute": "_(host)=INFOSVR::(transformation_project)=minimal::(dsjob)=flow1",
            "targetAttribute": "(host)=INFOSVR::(data_connection)=MINIMAL::(database_schema)=db2inst1::(database_table)=EMPLNAME"
        },
        {
            "sourceAttribute": "(host)=INFOSVR::(data_file_folder)=/::(data_file_folder)=data::(data_file_folder)=files::(data_file_folder)=minimal::(data_file)=names.csv::(data_file_record)=names",
            "targetAttribute": "_(host)=INFOSVR::(transformation_project)=minimal::(dsjob)=flow1"
        }
    ],
    "collection":
    {
        "qualifiedName": "(host)=INFOSVR::(transformation_project)=minimal",
        "name": "minimal"
    },
    "updateSemantic": "REPLACE"
}

It can be configured as part of the overall connector configuration using the mode setting of JOB_LEVEL (if left out of the configuration, this defaults to the GRANULAR setting which should give the original behaviour still):

{
    "class": "DataEngineProxyConfig",
    "accessServiceRootURL": "{{baseURL}}",
    "accessServiceServerName": "omas",
    "eventsClientEnabled": true,
    "dataEngineConnection": {
        "class": "Connection",
        "connectorType": {
            "class": "ConnectorType",
            "connectorProviderClassName": "org.odpi.egeria.connectors.ibm.datastage.dataengineconnector.DataStageConnectorProvider"
        },
        "endpoint": {
            "class": "Endpoint",
            "address": "{{igc_host}}:{{igc_port}}",
            "protocol": "https"
        },
        "userId": "{{igc_user}}",
        "clearPassword": "{{igc_password}}",
        "configurationProperties": {
            "mode": "JOB_LEVEL",
            "limitToProjects": [ "minimal" ]
        }
    },
    "pollIntervalInSeconds": 60
}

I've tried to test on my end but am getting various errors back that seem to be related to events processing in the OMAS (whether using the original behaviour or this new high-level lineage), so I'm presumably not using the latest configuration or something (not sure). Would be great if you can test further and see if it needs further revision?

lpalashevski commented 3 years ago

I've tried to test on my end but am getting various errors back that seem to be related to events processing in the OMAS (whether using the original behaviour or this new high-level lineage), so I'm presumably not using the latest configuration or something (not sure). Would be great if you can test further and see if it needs further revision?

On it.. we are going to start testing using some of the samples we have in our test environments using latest Egeria core. The output above looks expected and I am looking forward to see how it will go with real data. Keep you posted.

odpi / egeria-connector-ibm-information-server

Data Stage connector support for high level lienage #550