Smart repository proxy - Githubissues

mandy-chessell commented 3 years ago

The smart repository proxy enhances a third-party metadata repository that does not support reference copies to allow:

References copies from other metadata sources to be stored in the third-party metadata repository
Metadata that has been imported into the third-party metadata repository using proprietary mechanisms to be shared on the cohort even when other cohort members have the same imported metadata.

In addition, the caching capability of the smart repository proxy can be used to improve the performance of the metadata repository's communication with the cohort.

Background reading:

The repository proxy - https://egeria.odpi.org/open-metadata-implementation/admin-services/docs/concepts/repository-proxy.html
Metadata provenance - https://egeria-project.org/open-metadata-publication/website/metadata-provenance/
Open metadata archives - https://egeria.odpi.org/open-metadata-resources/open-metadata-archives/

mandy-chessell commented 3 years ago

Scenario

An organization has been using a third-party metadata repository for some time. The repository only operates in a standalone mode, however, it does support export/import. The organization has installed one instance of the repository for their governance team to use. This is where common definitions (glossary terms, policies etc) are developed.

These definitions are then loaded into the instances of the repository that are located in each of the 3 business units using the export/import mechanism. There is a copy of the common definitions in each repository. This way each part of the business uses the same common definitions.

The organization then decides to use Egeria to connect the business unit metadata repositories together to share metadata.

All would be well if the third-party metadata repository supported reference copies, but it does not. It treats the common definitions it has imported in the same way as any metadata defined through its UI. In Egeria terminology, all elements stored in its repository are considered part of its home metadata collection. When it shares the common definitions across the cohort, it identifies them as belonging to its metadata collection.

One of two things can happen when each repository shares its copy of the common definitions with the other members of the cohort - depending on how the unique identifiers (GUIDs) were assigned to the common definitions in each repository when they were imported.

Each repository assigned its own unique identifier to each common definition. This is less common a situation because it makes it harder to update the imported metadata with a new version if the unique identifiers change. However, if this approach is used, the result is three copies of each of the common definitions stored in each repository. This needs to be handled through Egeria's deduplication support that links the copies together so that they can be retrieved as if one entity.
The more common situation is where the unique identifiers from the originating metadata repository are used for the copies of the common definitions in each of the business unit instances of the metadata repository. This results in a protocol error being reported by the cohort because multiple repositories are claiming ownership of the same element. In an ideal world, the common definitions should have the metadata collection id of the governance team's metadata repository (irrespective of whether the governance repository instance ever joins the cohort or not). This ensures each of the business unit repository instances do not update the common definitions.

It is the second situation that the smart repository proxy is aiming to support.

mandy-chessell commented 3 years ago

The smart repository proxy replaces the current repository proxy. It maintains a list of the metadata elements stored in its third party metadata repository that should be treated as reference copies (ie as belonging to another metadata collection). It then performs the following services:

Monitoring for changes to the copies of these metadata elements in the third party metadata repository. When a change occurs, the smart repository proxy is informed by an event and it:
- Creates an audit log message to report violation since these values should not change.
- If enabled, it restores the copy back to its proper values in the third party metadata repository.
Whenever the metadata elements that should be reference copies are requested from the cohort either via the event mechanism or via a federated query, the smart repository proxy provides the official copy of the element with the correct header (metadata collection id, version, times etc). The copy in the third party metadata repository is never passed to the cohort.

mandy-chessell commented 3 years ago

The smart repository proxy maintains its list of reference copies in a OMRS repository. This is supplied as a connection object passed to the repository proxy at start up. Its contents can be bootstrapped from an open metadata archive also loaded on start up of the repository proxy. It may be augmented with other reference copies that are received from the cohort and are then passed on to the third party metadata repository.

mandy-chessell commented 3 years ago

Returning to the scenario at the top of this issue...

The simplest way to create the list of reference copy instances is by creating a metadata archive for the common definitions. This can be created by:

First connecting a repository proxy to the governance third party metadata repository. This repository proxy is not connected to the cohort - it is used to provide a repository services REST API for the third-party metadata repository. The configuration for this repository proxy defines the metadata collection id for the the governance third party metadata repository. This value should always be used for this repository, both when creating open metadata archives and if/when this repository eventually joins the cohort.
Writing a bespoke archive writer utility that querys the common definitions though the repository proxy connected to the governance metadata repository using the reposiotry services rest API and then using the repository services archive utilities to store the results in an open metadata archive file. This archive is of type METADATA_EXPORT and all of the instance within it have the metadata collection id of the governance third-party metadata repository.

The open metadata archive is then used to populate an embedded in-memory repository whenever the smart repository proxy is started. The list of common definition instances from the archive provide the list of instances that the smart repository proxy will monitor for and use when communicating with the cohort. The embedded in-memory repository acts as the cache of these metadata instances.

mandy-chessell commented 3 years ago

If the organization wants to store metadata from other cohort members in the third-party metadata repository, it uses a persistent repository connector for the cache so it can keep track of all reference copies that it is dynamically storing in the third-party metadata repository. Open metadata archives can still be used to represent content that are logically reference copies that was added to the third party metadata repository via other mechanisms. They can also be used to load new common definitions into the third-party metadata repository.

mandy-chessell commented 3 years ago

The smart repository proxy may use its store to assemble open metadata elements together before storing them in the third party metadata repository if the third party metadata repository has courser-grained elements.

mandy-chessell commented 3 years ago

Implementation

The smart repository proxy can be implemented as two new connectors that run in the existing repository proxy OMAG Server.

There are two specialist connectors configured in the repository proxy that are responsible with communication with the third-party repository:

The repository connector acts as a wrapper around the third-party metadata repository's proprietary API.
The event mapper intercepts events from the third-party metadata repository whenever its metadata changes and converts them into OMRS events to send on the cohort.

The smart repository proxy runs two additional connectors that wrap the third party ones.

These connectors (shown in grey) are completely generic and can run with any third party metadata repository connectors.

The Smart Proxy Repository Connector wraps the Repository Connector for the third party metadata repository and an Egeria Repository Connector that holds the list of reference copy instances (labelled cache in the diagram). It intercepts requests for metadata instances from the cohort.
- Request to retrieve specific instances by GUID: It first tries the cache and returns the instance(s) if they are found there. If not found in the cache, it passes the request on to the Third Party Metadata Repository Connector and returns the results.
- Other retrieval requests are passed to the Third Party Metadata Repository Connector. Any results returned are then scrutinized for matches with the elements in the cache - the cache versions replace the ones from third party metadata repository when a match is found. The modified results are passed to the cohort.
- Requests to create new metadata are passed to the Third Party Metadata Repository Connector
- Requests to update metadata are checked to ensure the instance is not in the cache and then are passed to the Third Party Metadata Repository Connector. If they are found in the cache then a protocol violation is reported.
- Requests to delete metadata elements via events are passed to both the Third Party Metadata Repository Connector and the cache. (Requests to delete references copies via the API are a protocol violation - see comments below for clarification.)

The Smart Proxy Event Mapper wraps the Third Party Metadata Event Mapper and acts as the event publisher that the Third Party Metadata Event Mapper is given at start up. The Third Party Metadata Event Mapper monitors changes in the third-party metadata repository and pushes OMRS events to its event publisher. This is the Smart Proxy Event Mapper that works with the Smart Proxy Repository Connector to determine if this is a change to the reference copy or a valid home instance. Events for valid home instances are passed to the cohort. Events for reference copies result in the audit log message and possible corrective action in the third party metadata repository.

mandy-chessell commented 3 years ago

Administration

Although the Smart Repository Proxy's connectors require no changes to the Egeria runtime to operate, it would help users if the admin services were enhanced to help build the nested connection objects required to configure the nested connectors used in the smart repository connector and well as set up the configuration properties that control the behaviour of the Smart Repository Proxy's connectors.

cmgrote commented 3 years ago

All looks great -- just a question on this statement:

Requests to delete metadata are passed both the Third Party Metadata Repository Connector and the cache.

Deletes, or purges? I would have thought a delete would be a protocol violation, as neither connector is the home repository and therefore should not be able to handle a soft-delete of the instance (?)

I was expecting only a purgeReferenceCopy would be handled against the cache, though perhaps this would need to be passed onwards as a purgeInstance call against the Third Party Metadata Repository Connector (?)

mandy-chessell commented 3 years ago

Good question:

There are three forms of delete in the events

soft delete
purge
delete purge

In all three cases, reference copies are removed from both repositories.

However, you are right that a delete or purge through the API of a reference copy is a protocol violation

lenawoolf commented 3 years ago

@mandy-chessell Mandy, Thinking about scenario of 3 third-party repositories containing metadata instances loaded with the same import script and therefore the same GUIDs. Each third-party repository therefore contains subset of metadata that it owns and subset that it was given, with fixed GUIDs. What approach would company use to identify which one of the three repositories should have its subset of common GUIDs maintained as "owned"? In other words, how does one select which repo should be owning common set vs keeping a reference? I suppose it would come back to the reasons all three were loaded with same import. Maybe one is a Dev env, second one is Test, and third is a Prod? In that case, should they even be part of the same cohort? Or, most likely, one is a head office (or governance repository) and the other two are subsidiary.

You mention that " governance third party metadata repository" would not be connected to cohort and can be used to build metadata archive. What happens when new instances are added to governance third party repository? How do other reference repositories have their cache updated?

mandy-chessell commented 3 years ago

The rule does not change - the owner of the metadata is the originator. This is reflected in the metadata collection id in the header of the instance. In the scenario described above, the governance metadata repository is the owner of the common definitions since its metadata collection id is in the header of each element in the archive.

At the time before the governance metadata repository joins the cohort, no member of the cohort can change the content since all copies imported via the archive are reference copies. Updates to the common definitions from the governance metadata repository are introduced through a new archive/import file.

If/when the governance metadata repository joins the cohort, it continues to be the owner of the common definitions. The difference is that changes to its instances are distributed to the other cohort members immediately through the cohort mechanisms.

If the governance metadata repository is to be decommissioned, then it is possible to change ownership of its elements using the rehome commands. The metadata collection id is set to the repository that is taking over responsibility for maintaining the common definitions.

This is all standard cohort operation that was defined in the original OMRS spec. All the smart repository proxy adds is mitigation when the third party repository does not support reference copies.

lenawoolf commented 3 years ago

Thank you Mandy @mandy-chessell I am trying to visualize final deployment topology with Smart Proxy in place. Initial set up, as per https://github.com/odpi/egeria/issues/5402#issuecomment-871526274, was 4 systems, one is "governance", importing common definitions into downstream three systems. Issue was - impossible to connect all four systems via single Egeria cohort. Also impossible to connect 3 systems (without governance system) together. Solution recap: export common definitions from "governance" repository, put Smart Proxies in front of three downstream systems and prime proxies with exported data. Now those three downstream systems can talk to each other via Egeria. Updates of common definitions in Proxies are done via exports. If the "governance" system wants to join cohort as well, all works as normal, but only if Smart Proxy recognizes instances from "governed" system as owner of reference copies.

I wonder if there is a way to indicate to all the Smart Proxies that particular member of cohort is "owner" of duplicate GUIDs that can be found in source system, and therefore they should be treated as "reference" copies everywhere. That way we might avoid archive import process to prime proxies all together.

mandy-chessell commented 3 years ago

The archive import is precisely that. It is acting as the mechanism to tell the smart repository proxies which instances in their third party metadata repository have incorrect metadata collection ids and what the metadata collection ids should be so that it can communicate with the other cohort members in a compliant way.

The third party metadata repository is providing incorrect metadata collection ids in its instances because it does not support reference copies and so has not stored what the metadata collection id should be.

The other aspect of reference copies that needs to be supported is that they should not be changed by the third party metadata repositor(y/ies). Having complete instances in the archive means that the smart repository proxy can send out the completely correct version of the reference copy to the cohort. If there is an event mapper, the smart repository proxy can detect updates to these reference copy instances and record the violation in the audit log. If the repository connector is sophisticated enough, it can restore the correct values in the third party metadata repository as well.

The archive is only needed when the common definitions are being maintained by imports. The smart repository proxy can dynamically build the list of reference copies coming from other members of the cohort. For that it needs a persistant repository. If this persistant repository supports history then the smart repository proxy can support historical queries on behalf of its third party metadata repository as well.

With the smart proxy, we can take a repository, such as Apache Atlas, that does not support reference copies or historical queries and enable a compliant two-way exchange of metadata with it.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

odpi / egeria

Smart repository proxy #5402

Open metadata archives - https://egeria.odpi.org/open-metadata-resources/open-metadata-archives/