Category: Enhancement
Scope: Planning

Open-Catalog - Planning (metadata catalog)

General

The purpose of this document is to create a high level plan for a metadata catalog capability as specified in issue 685

The work will be incremental in such way that the external API will be created first to fulfill the different requirements that arrive from existing metadata needs such as:

PPL connectivity metadata discovery API described in 561
PPL query validation API described in 561
Query Cost based Optimizer statistics API described in 612

Setup Stage

In This process we need to create an open-search new project named open-knowledge

Stage 1:

This part will include creating the external API that will be in use by PPL and additional clients. In order to be agile (and allow simple evolution) in the API spec definition and also capable of sharing a common language we will use a Generic DSL.

Here we have few alternatives:

Smithy DSL (https://github.com/awslabs/smithy) will allow us to decouple the specification from the actual implementation.
GraphQL DSL (https://graphql.org/learn/schema/) will also allow us to create an independent specification from the actual implementation and allow it to evolve according to requirements, see RFC 698
YangDB's DSL
- https://github.com/YANG-DB/yang-db/blob/develop/docs/tutorial/sample/dragons/create-ontology.md
- https://github.com/YANG-DB/yang-db/blob/dev-opensearch/docs/info/components/ontology.md

This proprietary DSL has support for both logical entity-relation topology and for lower level aspects of storage usability concerns.

If we select this option to represent the API we need to implement an API generation mechanism since it currently has no API auto-generation capability.

This step will result in 3 (or more) DSL for different needs -

validation API
statistics API
connectivity API

The resulting artifact will also be able to auto-generate a java implementation of the client/server API spec to function as a stub for external usage in system tests.

references

Stage 2:

Once the external API is created with some level of confident, the next phase will be to implement the internal storage catalog Ontology that will allow representing these aspects with the ability to evolve.

We will take advantage of the knowledge graph concepts to create a generic Ontology which will reflect the API common Essential entities that will be common to the Ontology:

Example proposed Entities

Dataset The DATASET entity represent collections of data that are typically represented as Tables or Views , Streams in a stream-processing environment, bundles of data found as Files or Folders in data lake systems .
Table (Index) The TABLE entity represent collections of columns that are typically represented as logical unit with a business meaning or significant.
Dashboard The DASHBOARD entity represents a collection of Tables or Queries for visualization.
Role The ROLE entity represents a logical action that can be performed upon another asset (resource)

These entities list if partial and is to be considered an example only

See suggested GraphQL partial schema RFC 698

Ontology

This Ontology will support versioning for backward and forward compatibility, it will be maintained in a dedicated location. Since we are intending that the data-store that is holding the metadata registry and content will be open-search, this topology must have an index generator capability that would allow for schema creation in the underlying open-search engine.

Currently, the only available option representing such a general purpose topology (and index generation) will be the YangDB's DSL. With respect to the choice we made on which DSL we use to generate the API we may have to implement a DSL to DSL translator from the chosen API DSL to YangDB's Ontology.

The resulting artifact will also be as follows:

Ontology representation
API DSL to Ontology Converter (currently one direction only
- This tool will need to support multi-API conversion in the future
Ontology Index-Generator Support

Stage 3:

This stage will focus on the general purpose query language to allow asking cross-domain question on the metadata catalog. We will select a cross domain general purpose (graph) query language that will allow us to reflect multiple aspects we need to collect in a single query.

Currently, the most likely language will be open-cypher or GQL with its preliminary release.

In contrast with the specific API, this query is untyped and will return a collection of rows that may be later transformed into meaningful logical entities. Supporting a general purpose graph language is a very large task and we will take advantage of YangDB's existing query translators to convert the test query into an open-search specific query.

This stage will be heavily depended on the progress being done on the open-graph feature and will be in-fact a joined effort.