opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
118 stars 138 forks source link

[PLANNING] Open-Catalog project #699

Closed YANG-DB closed 1 year ago

YANG-DB commented 2 years ago

Open-Catalog - Planning (metadata catalog)

General

The purpose of this document is to create a high level plan for a metadata catalog capability as specified in issue 685

The work will be incremental in such way that the external API will be created first to fulfill the different requirements that arrive from existing metadata needs such as:

Setup Stage

In This process we need to create an open-search new project named open-knowledge

Stage 1:

This part will include creating the external API that will be in use by PPL and additional clients. In order to be agile (and allow simple evolution) in the API spec definition and also capable of sharing a common language we will use a Generic DSL.

Here we have few alternatives:

This proprietary DSL has support for both logical entity-relation topology and for lower level aspects of storage usability concerns.

If we select this option to represent the API we need to implement an API generation mechanism since it currently has no API auto-generation capability.


This step will result in 3 (or more) DSL for different needs -

The resulting artifact will also be able to auto-generate a java implementation of the client/server API spec to function as a stub for external usage in system tests.

references

Stage 2:

Once the external API is created with some level of confident, the next phase will be to implement the internal storage catalog Ontology that will allow representing these aspects with the ability to evolve.

We will take advantage of the knowledge graph concepts to create a generic Ontology which will reflect the API common Essential entities that will be common to the Ontology:

Example proposed Entities

These entities list if partial and is to be considered an example only

See suggested GraphQL partial schema RFC 698


Ontology

This Ontology will support versioning for backward and forward compatibility, it will be maintained in a dedicated location. Since we are intending that the data-store that is holding the metadata registry and content will be open-search, this topology must have an index generator capability that would allow for schema creation in the underlying open-search engine.

Currently, the only available option representing such a general purpose topology (and index generation) will be the YangDB's DSL. With respect to the choice we made on which DSL we use to generate the API we may have to implement a DSL to DSL translator from the chosen API DSL to YangDB's Ontology.

The resulting artifact will also be as follows:

Stage 3:

This stage will focus on the general purpose query language to allow asking cross-domain question on the metadata catalog. We will select a cross domain general purpose (graph) query language that will allow us to reflect multiple aspects we need to collect in a single query.

Currently, the most likely language will be open-cypher or GQL with its preliminary release.

In contrast with the specific API, this query is untyped and will return a collection of rows that may be later transformed into meaningful logical entities. Supporting a general purpose graph language is a very large task and we will take advantage of YangDB's existing query translators to convert the test query into an open-search specific query.

This stage will be heavily depended on the progress being done on the open-graph feature and will be in-fact a joined effort.

The resulting artifact will also be as follows:

YANG-DB commented 1 year ago

use existing project for these concepts https://github.com/opensearch-project/opensearch-catalog