odpi / egeria-connector-ibm-information-server

IBM Information Server connectors for Egeria: repository proxy connector for IGC, data engine proxy connector for DataStage.
https://odpi.github.io/egeria-connector-ibm-information-server
Apache License 2.0
27 stars 20 forks source link

Data Stage proxy governance rule creation #432

Open AlexCostoiu opened 3 years ago

AlexCostoiu commented 3 years ago

Hello,

Opening this issue regarding the Information Governance Rules created for synchronization in IGC source environment. As per their definitions these rules are meant to govern other assets and to show that those specific assets comply with certain business requirements.

So this glossary item type is not used as it was meant to be by the proxy. Also because of this the Glossary Author role is needed and also glossary items are created in the source catalog(which probably shouldn't happen as they are not relevant for their metadata). Another point is the risk that certain business users might also delete this rule and the proxy will lose it's last synchronization time.

Considering all the above is there a plan to record and store the timestamp of the last proxy run for each project on the proxy side or on the target(Egeria) side? Can this be changed?

Best Regards, Alex

cmgrote commented 3 years ago

Hi @AlexCostoiu,

I'm not sure I necessarily agree with your assessment:

As per their definitions these rules are meant to govern other assets and to show that those specific assets comply with certain business requirements. So this glossary item type is not used as it was meant to be by the proxy.

Metadata about data processing is an asset itself (to compliance teams and potentially other business areas).

  1. It is important to share this information about data processing, to inform end-to-end lineage including systems and other assets that are beyond the boundary of this particular system (IGC / DataStage)
  2. It is important to understand the breadth of information / systems / assets covered
  3. It is important to understand how "fresh" or up-to-date this information is

These governance rules therefore capture that (1) the information is shared, (2) its scope (the projects included), and (3) when it was last shared. In your definition, the rule therefore shows that this specific asset (data processing information) complies with certain business requirements (1, 2, and 3). You might even suggest that there is a governance policy under which these rules be placed defining something about the need for providing "end-to-end data lineage".

But of course you can disagree with this assessment. To your question about changing the behaviour: in the most simple terms, the connector code here is provided under an open source license, so you are free to fork it and do whatever you like with it 😉

In regards to changing its current design here in this repository, we'd need to determine what that revised design should be, and then there are various avenues we could pursue to achieve it (pull requests, etc).

Regarding your suggestions above: we have not considered adding such recording in the proxy itself, as the proxy itself is basically stateless (the state is all managed through the repository it proxies -- IGC itself). Adding a separate layer of state management would fundamentally change the operating characteristics of the proxy (requiring additional components and persistent storage to be available at all times the proxy itself is running). I believe this would actually change the behaviour of proxies in Egeria in general, so would be a fairly fundamental change not just to this connector but to Egeria as well.

I'm also not sure it makes sense to put this state management on the "target" side, as the proxy's responsibility is for reading from the source and broadcasting outwards, not how that broadcast is ultimately distributed to one (or potentially many) targets. With many potential targets, such an approach would quickly enter the territory of distributed systems challenges like quorums, split brain handling, etc.

I would suggest that both approaches would therefore add significant complexity beyond the current implementation.

So perhaps it is better to first consider general options on approaches: