odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
786 stars 258 forks source link

Smart repository proxy #5402

Open mandy-chessell opened 3 years ago

mandy-chessell commented 3 years ago

The smart repository proxy enhances a third-party metadata repository that does not support reference copies to allow:

In addition, the caching capability of the smart repository proxy can be used to improve the performance of the metadata repository's communication with the cohort.


Background reading:

mandy-chessell commented 3 years ago

Scenario

An organization has been using a third-party metadata repository for some time. The repository only operates in a standalone mode, however, it does support export/import. The organization has installed one instance of the repository for their governance team to use. This is where common definitions (glossary terms, policies etc) are developed.

These definitions are then loaded into the instances of the repository that are located in each of the 3 business units using the export/import mechanism. There is a copy of the common definitions in each repository. This way each part of the business uses the same common definitions.

image

The organization then decides to use Egeria to connect the business unit metadata repositories together to share metadata.

image

All would be well if the third-party metadata repository supported reference copies, but it does not. It treats the common definitions it has imported in the same way as any metadata defined through its UI. In Egeria terminology, all elements stored in its repository are considered part of its home metadata collection. When it shares the common definitions across the cohort, it identifies them as belonging to its metadata collection.

One of two things can happen when each repository shares its copy of the common definitions with the other members of the cohort - depending on how the unique identifiers (GUIDs) were assigned to the common definitions in each repository when they were imported.

It is the second situation that the smart repository proxy is aiming to support.

mandy-chessell commented 3 years ago

The smart repository proxy replaces the current repository proxy. It maintains a list of the metadata elements stored in its third party metadata repository that should be treated as reference copies (ie as belonging to another metadata collection). It then performs the following services:

mandy-chessell commented 3 years ago

The smart repository proxy maintains its list of reference copies in a OMRS repository. This is supplied as a connection object passed to the repository proxy at start up. Its contents can be bootstrapped from an open metadata archive also loaded on start up of the repository proxy. It may be augmented with other reference copies that are received from the cohort and are then passed on to the third party metadata repository.

mandy-chessell commented 3 years ago

Returning to the scenario at the top of this issue...

The simplest way to create the list of reference copy instances is by creating a metadata archive for the common definitions. This can be created by:

image

The open metadata archive is then used to populate an embedded in-memory repository whenever the smart repository proxy is started. The list of common definition instances from the archive provide the list of instances that the smart repository proxy will monitor for and use when communicating with the cohort. The embedded in-memory repository acts as the cache of these metadata instances.

image

mandy-chessell commented 3 years ago

If the organization wants to store metadata from other cohort members in the third-party metadata repository, it uses a persistent repository connector for the cache so it can keep track of all reference copies that it is dynamically storing in the third-party metadata repository. Open metadata archives can still be used to represent content that are logically reference copies that was added to the third party metadata repository via other mechanisms. They can also be used to load new common definitions into the third-party metadata repository.

image

mandy-chessell commented 3 years ago

The smart repository proxy may use its store to assemble open metadata elements together before storing them in the third party metadata repository if the third party metadata repository has courser-grained elements.

mandy-chessell commented 3 years ago

Implementation

The smart repository proxy can be implemented as two new connectors that run in the existing repository proxy OMAG Server.

There are two specialist connectors configured in the repository proxy that are responsible with communication with the third-party repository:

image

The smart repository proxy runs two additional connectors that wrap the third party ones.

image

These connectors (shown in grey) are completely generic and can run with any third party metadata repository connectors.

mandy-chessell commented 3 years ago

Administration

Although the Smart Repository Proxy's connectors require no changes to the Egeria runtime to operate, it would help users if the admin services were enhanced to help build the nested connection objects required to configure the nested connectors used in the smart repository connector and well as set up the configuration properties that control the behaviour of the Smart Repository Proxy's connectors.

cmgrote commented 3 years ago

All looks great -- just a question on this statement:

Requests to delete metadata are passed both the Third Party Metadata Repository Connector and the cache.

Deletes, or purges? I would have thought a delete would be a protocol violation, as neither connector is the home repository and therefore should not be able to handle a soft-delete of the instance (?)

I was expecting only a purgeReferenceCopy would be handled against the cache, though perhaps this would need to be passed onwards as a purgeInstance call against the Third Party Metadata Repository Connector (?)

mandy-chessell commented 3 years ago

Good question:

There are three forms of delete in the events

In all three cases, reference copies are removed from both repositories.

However, you are right that a delete or purge through the API of a reference copy is a protocol violation

lenawoolf commented 3 years ago

@mandy-chessell Mandy, Thinking about scenario of 3 third-party repositories containing metadata instances loaded with the same import script and therefore the same GUIDs. Each third-party repository therefore contains subset of metadata that it owns and subset that it was given, with fixed GUIDs. What approach would company use to identify which one of the three repositories should have its subset of common GUIDs maintained as "owned"? In other words, how does one select which repo should be owning common set vs keeping a reference? I suppose it would come back to the reasons all three were loaded with same import. Maybe one is a Dev env, second one is Test, and third is a Prod? In that case, should they even be part of the same cohort? Or, most likely, one is a head office (or governance repository) and the other two are subsidiary.

You mention that " governance third party metadata repository" would not be connected to cohort and can be used to build metadata archive. What happens when new instances are added to governance third party repository? How do other reference repositories have their cache updated?

mandy-chessell commented 3 years ago

The rule does not change - the owner of the metadata is the originator. This is reflected in the metadata collection id in the header of the instance. In the scenario described above, the governance metadata repository is the owner of the common definitions since its metadata collection id is in the header of each element in the archive.

At the time before the governance metadata repository joins the cohort, no member of the cohort can change the content since all copies imported via the archive are reference copies. Updates to the common definitions from the governance metadata repository are introduced through a new archive/import file.

If/when the governance metadata repository joins the cohort, it continues to be the owner of the common definitions. The difference is that changes to its instances are distributed to the other cohort members immediately through the cohort mechanisms.

If the governance metadata repository is to be decommissioned, then it is possible to change ownership of its elements using the rehome commands. The metadata collection id is set to the repository that is taking over responsibility for maintaining the common definitions.

This is all standard cohort operation that was defined in the original OMRS spec. All the smart repository proxy adds is mitigation when the third party repository does not support reference copies.

lenawoolf commented 3 years ago

Thank you Mandy @mandy-chessell I am trying to visualize final deployment topology with Smart Proxy in place. Initial set up, as per https://github.com/odpi/egeria/issues/5402#issuecomment-871526274, was 4 systems, one is "governance", importing common definitions into downstream three systems. Issue was - impossible to connect all four systems via single Egeria cohort. Also impossible to connect 3 systems (without governance system) together. Solution recap: export common definitions from "governance" repository, put Smart Proxies in front of three downstream systems and prime proxies with exported data. Now those three downstream systems can talk to each other via Egeria. Updates of common definitions in Proxies are done via exports. If the "governance" system wants to join cohort as well, all works as normal, but only if Smart Proxy recognizes instances from "governed" system as owner of reference copies.

I wonder if there is a way to indicate to all the Smart Proxies that particular member of cohort is "owner" of duplicate GUIDs that can be found in source system, and therefore they should be treated as "reference" copies everywhere. That way we might avoid archive import process to prime proxies all together.

mandy-chessell commented 3 years ago

The archive import is precisely that. It is acting as the mechanism to tell the smart repository proxies which instances in their third party metadata repository have incorrect metadata collection ids and what the metadata collection ids should be so that it can communicate with the other cohort members in a compliant way.

The third party metadata repository is providing incorrect metadata collection ids in its instances because it does not support reference copies and so has not stored what the metadata collection id should be.

The other aspect of reference copies that needs to be supported is that they should not be changed by the third party metadata repositor(y/ies). Having complete instances in the archive means that the smart repository proxy can send out the completely correct version of the reference copy to the cohort. If there is an event mapper, the smart repository proxy can detect updates to these reference copy instances and record the violation in the audit log. If the repository connector is sophisticated enough, it can restore the correct values in the third party metadata repository as well.

The archive is only needed when the common definitions are being maintained by imports. The smart repository proxy can dynamically build the list of reference copies coming from other members of the cohort. For that it needs a persistant repository. If this persistant repository supports history then the smart repository proxy can support historical queries on behalf of its third party metadata repository as well.

With the smart proxy, we can take a repository, such as Apache Atlas, that does not support reference copies or historical queries and enable a compliant two-way exchange of metadata with it.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.