opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
118 stars 138 forks source link

[RFC] Metadata Catalog API #685

Closed YANG-DB closed 1 year ago

YANG-DB commented 2 years ago

Catalog

The purpose of this feature (enhancement) is to explain and discuss the need for a generalized metadata knowledge catalog .

Why is a Catalog needed ?

The modern data stores and engines experience an extreme growth in the volume, variety, and velocity of data. Many of which are 'on demand' data-sets which are partially structured (or not structured at all). While this type of systems tend to offer some data discovery method to navigate the large amount of the diverse data, they sometimes tend to neglect the common practice of generating on-the-fly data structures (projections and materializations) that are being used similar to the raw data sets.


Definition

One of the simplest definitions for a data catalog goes as follows:

“A data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.”

In a modern enterprise, there is a dazzling array of different types of assets : tables in relational databases or in NoSQL stores, streams in different workflows, features the AI system, metrics in the observability platform, dashboards in the visualization tool, etc. The modern data catalog is expected to contain an inventory of all these kinds of data assets.

Use Cases

An important question concerns the kind of metadata we want to store in the data catalog, because that influences the use cases we can enable.

Some common use cases:

Every use case often brings in special metadata needs, and also requires connectivity for existing metadata brought in by other use cases.

In Practice

Metadata can be typically ingested using the following methods:

This ingestion process is usually running once a day or so, during the ingestion, there are often some transformation of the raw metadata into the semantic metadata model - since the data is rarely in the exact form that the catalog wants it. Typically, this transformation is embedded into the ingestion job directly.

These type of metadata ingestion typically leads to 'batch' like workloads (nightly refreshing from the source) and non-incremental, this effects the (meta)data freshness quality.


Providing a (push-based) schematic interface immediately creates a contracts between producers of metadata and the “central metadata”

Domain Model

After Ingestion of the (meta)data - it is required to access the collected data in a common way which will represent different domain attributes of the (collected) data models ( knowledge ).

For example - collecting RDBMS metadata of some inventory which contains details of the different tables and views in the schema, and in-addition collecting metadata regarding the business process (pipeline) that is applied on-top of these tables (such as executed queries and job schedules).

Example

In this case every table will be represented as a domain entity with relations to other entities.

Example

TODO add another example for pipline domain schema


In General - a general logical abstraction is needed to represent any type of domain and this logical representation must be capable of evolving without causing any physical changes.

Knowledge Topology

Domain-oriented metadata models enables extensible, strongly-typed metadata models and relationships to be defined collaboratively by the enterprise. Strong typing is important, because without that - only the least common denominator of generic property-bags being stored in the metadata store (which makes it impossible for programmatic consumers of metadata to process metadata with any guarantee of backwards compatibility).

In the metadata model graph, using a terminology of Entity Types, Relationships and Aspects - will reflect this concerns.

Suggested Topology High Level Entities

The basic knowledge which is used to describe a graph with default types of high level abstract entities: Datasets, Roles, and Groups. Different aspects such as Ownership, Physical-Schema, Profile & such can be attached (as a typed relationship) to these entities.

This topology is represented in practice using a property graph that is both generic to allow any general purpose domain semantic and can also evolve freely without (almost) any constraints. In such a graph the basic element are typed entities and relations which are labeled with some kind of type identification, these elements also have properties that represent the fields of these domain elements.

This topology will allow asking complex questions which will enable (the asking side) to declare a "schema on read" type of questions:

Example

TODO add example for 'schema on read'

Entities & Aspects

The purpose of Aspects is to give the entity a different capability to express additional features without changing the internal fields of that entity.

To allow a generic knowledge structure (which can contain multiple business domains in the same place) we predefine 3 types of default high-level entities that are the heart of the generic knowledge domain - roles, group, dataset.

These entity types are always present no matter which business domain they represent.

On top of these entities we will create additional aspects that refer to different point of views for these elements:

A 'Table' Entity example

Here are some aspects for a 'table' type entity:

Each aspect may be of interest to different point of views and therefor it will be queried using a specific query to fetch the needed aspects:

A graph query type API (cypher graph language) is used for expressing the query:

Match (d:Dataset {name:inventory})-[:contains]-(t:Table {name:products})-[:aspect]-(a:Physical,Statistics)

This query will fetch the inventory.products table including two connected aspects - physical schema details and the statistical details.

Validation, Execution & Performance

For the query execution use case, the catalog can be used for the following tasks:

Examine the 'products' table (from the previous presented inventory-system-er-diagram schema) as an example - the columns are represented as fields:

Validation: Given any query that contains this specific table we need to verify:

Planning

When an execution plan is created, one of its basic steps will be to give that plan a score which will allow the planner to choose the most performant plan.

In a Heuristic planner, each step of the plan is estimated according to some statistics that are being kept as metadata. These cost measurements include:

All these statistics are part of the entity's (table) Statistics Aspect.

Graph Storage Layer

Some storage repository are not able to handle indexing as an internal matter like in RDBMS, in such (NoSQL) for such cases it is needed to create indices for improving the query performance and data retrieval efficiency.

The knowledge of these indices (or for that matter materialized views) will also be kept in the catalog and the execution planner will be able to take advantage of these indices to create an effective cost-effective plan.

For example - using open-search for storing the inventory tables information - there will be cases where join information data from two or more tables.

These cases may be a translated by the execution engine into materialized views for such joins to improve query performance, the catalog must be aware of the existence of these views and supply the information to be queried by external API - the query planner is a consumer for such API.

This information will be saved in the Physical Aspect:

API

The catalog will support 2 type of APIs:

Additional Dependencies

External References

penghuo commented 2 years ago

https://github.com/opensearch-project/sql/issues/561 defines catalog also. Do we propose a more generic version of catalog?

YANG-DB commented 2 years ago

yes - the purpose will be to create a common access point that represent many aspects of the metadata API such as:

YANG-DB commented 1 year ago

use https://github.com/opensearch-project/opensearch-catalog project to utilize this api