[RFC] Metadata Catalog API

YANG-DB commented 2 years ago

Category: Enhancement
Scope: Epic

Catalog

The purpose of this feature (enhancement) is to explain and discuss the need for a generalized metadata knowledge catalog .

It will discuss the general needs for a data-catalog representation and its usage during query execution and data discovery.
It will review the existing solutions for maintaining a catalog repository in the open source domain.
It will define the common internal structure for such a catalog including the knowledge domain and architecture.

Why is a Catalog needed ?

The modern data stores and engines experience an extreme growth in the volume, variety, and velocity of data. Many of which are 'on demand' data-sets which are partially structured (or not structured at all). While this type of systems tend to offer some data discovery method to navigate the large amount of the diverse data, they sometimes tend to neglect the common practice of generating on-the-fly data structures (projections and materializations) that are being used similar to the raw data sets.

Definition

One of the simplest definitions for a data catalog goes as follows:

“A data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.”

In a modern enterprise, there is a dazzling array of different types of assets : tables in relational databases or in NoSQL stores, streams in different workflows, features the AI system, metrics in the observability platform, dashboards in the visualization tool, etc. The modern data catalog is expected to contain an inventory of all these kinds of data assets.

Use Cases

An important question concerns the kind of metadata we want to store in the data catalog, because that influences the use cases we can enable.

Some common use cases:

Search and Discovery: Data schemas, fields, tags, usage information
Validation, Execution & Performance: fields, types, cardinality, selectivity, cost based planning
Access Control: Access control groups, users, policies
Data Lineage: Pipeline executions, queries, API logs, API schemas
Compliance: Taxonomy of data privacy/compliance annotation types
Data Management: Data source configuration, ingestion configuration, retention configuration, data purge policies, data export policies
Data Ops: Pipeline executions, data partitions processed, data statistics
Data Quality: Data quality rule definitions, rule execution results, data statistics

Every use case often brings in special metadata needs, and also requires connectivity for existing metadata brought in by other use cases.

In Practice

Metadata can be typically ingested using the following methods:

(1) Crawling based approach by connecting to sources of metadata like the (RDBMS) database catalog, the Hive catalog, the Kafka schema registry, or any workflow orchestrator’s log files and then writing this metadata into the primary store (plus adding additional indexing info into the search index ).

This ingestion process is usually running once a day or so, during the ingestion, there are often some transformation of the raw metadata into the semantic metadata model - since the data is rarely in the exact form that the catalog wants it. Typically, this transformation is embedded into the ingestion job directly.

(2) Pull oriented metadata collection method in-which the crawler runs on different environment than the data source and its configuration needs to be managed separately.

These type of metadata ingestion typically leads to 'batch' like workloads (nightly refreshing from the source) and non-incremental, this effects the (meta)data freshness quality.

A (front-end) service that resides in front of the metadata storage database. The service offers an API that allows metadata to be written into the system using push mechanisms, and programs that need to read metadata programmatically can read the metadata using such API.

Providing a (push-based) schematic interface immediately creates a contracts between producers of metadata and the “central metadata”

Domain Model

After Ingestion of the (meta)data - it is required to access the collected data in a common way which will represent different domain attributes of the (collected) data models ( knowledge ).

For example - collecting RDBMS metadata of some inventory which contains details of the different tables and views in the schema, and in-addition collecting metadata regarding the business process (pipeline) that is applied on-top of these tables (such as executed queries and job schedules).

Example

In this case every table will be represented as a domain entity with relations to other entities.

Example

TODO add another example for pipline domain schema

In General - a general logical abstraction is needed to represent any type of domain and this logical representation must be capable of evolving without causing any physical changes.

Knowledge Topology

Domain-oriented metadata models enables extensible, strongly-typed metadata models and relationships to be defined collaboratively by the enterprise. Strong typing is important, because without that - only the least common denominator of generic property-bags being stored in the metadata store (which makes it impossible for programmatic consumers of metadata to process metadata with any guarantee of backwards compatibility).

In the metadata model graph, using a terminology of Entity Types, Relationships and Aspects - will reflect this concerns.

Suggested Topology High Level Entities

The basic knowledge which is used to describe a graph with default types of high level abstract entities: Datasets, Roles, and Groups. Different aspects such as Ownership, Physical-Schema, Profile & such can be attached (as a typed relationship) to these entities.

This topology is represented in practice using a property graph that is both generic to allow any general purpose domain semantic and can also evolve freely without (almost) any constraints. In such a graph the basic element are typed entities and relations which are labeled with some kind of type identification, these elements also have properties that represent the fields of these domain elements.

This topology will allow asking complex questions which will enable (the asking side) to declare a "schema on read" type of questions:

"schema on read" allows creating a query that will model the question & response in a specific way that matches the domain of the query sender.

Example

TODO add example for 'schema on read'

Entities & Aspects

The purpose of Aspects is to give the entity a different capability to express additional features without changing the internal fields of that entity.

To allow a generic knowledge structure (which can contain multiple business domains in the same place) we predefine 3 types of default high-level entities that are the heart of the generic knowledge domain - roles, group, dataset.

These entity types are always present no matter which business domain they represent.

On top of these entities we will create additional aspects that refer to different point of views for these elements:

Dataset entity will have a 'contains' relationship to a list of tables entities that represent the business domain of that dataset
Group will have a 'members' relationship to a list of users
Roles will have a different type of relationships to the dataset entities representing different access capabilities.

A 'Table' Entity example

Here are some aspects for a 'table' type entity:

Physical aspect - here the columns and constrains is specified
Statistics aspect - here the selectivity and cardinality of the columns is specified
Access aspect - here the field level access metadata is specified

Each aspect may be of interest to different point of views and therefor it will be queried using a specific query to fetch the needed aspects:

A graph query type API (cypher graph language) is used for expressing the query:

Match (d:Dataset {name:inventory})-[:contains]-(t:Table {name:products})-[:aspect]-(a:Physical,Statistics)

This query will fetch the inventory.products table including two connected aspects - physical schema details and the statistical details.

Validation, Execution & Performance

For the query execution use case, the catalog can be used for the following tasks:

validate the query against the schema structure
plan an efficient execution using the schema tables statistics

Examine the 'products' table (from the previous presented inventory-system-er-diagram schema) as an example - the columns are represented as fields:

Id : String [PK]
Name : String
Description : String
Unit : String
Price : Float
Quantity : Int
Status : Int
Supplier_ID : int [FK]
Category_ID : int [FK]

Validation: Given any query that contains this specific table we need to verify:

correct fields names
correct fields types (filters, aggregations)
correct join-by field names

Planning

When an execution plan is created, one of its basic steps will be to give that plan a score which will allow the planner to choose the most performant plan.

In a Heuristic planner, each step of the plan is estimated according to some statistics that are being kept as metadata. These cost measurements include:

column cardinality
filter selectivity
join approximation

All these statistics are part of the entity's (table) Statistics Aspect.

Graph Storage Layer

Some storage repository are not able to handle indexing as an internal matter like in RDBMS, in such (NoSQL) for such cases it is needed to create indices for improving the query performance and data retrieval efficiency.

The knowledge of these indices (or for that matter materialized views) will also be kept in the catalog and the execution planner will be able to take advantage of these indices to create an effective cost-effective plan.

For example - using open-search for storing the inventory tables information - there will be cases where join information data from two or more tables.

These cases may be a translated by the execution engine into materialized views for such joins to improve query performance, the catalog must be aware of the existence of these views and supply the information to be queried by external API - the query planner is a consumer for such API.

This information will be saved in the Physical Aspect:

related to the entity if it is an index of that table
related to the query (sub query) if it is a view (materialization) of more than one entity

API

The catalog will support 2 type of APIs:

General purpose graph query API (Cypher query language or similar)
- TODO add example
GraphQL based query API with a pre-build IDL schema
- TODO add example

Additional Dependencies

Support of ontology DSL
Support of executing graph query
Support schema on read query translation

External References

penghuo commented 2 years ago

https://github.com/opensearch-project/sql/issues/561 defines catalog also. Do we propose a more generic version of catalog?

YANG-DB commented 2 years ago

yes - the purpose will be to create a common access point that represent many aspects of the metadata API such as:

metadata discovery (auto-complete index names)
query validation
query optimization
ownership aspects
query statistics and more...

YANG-DB commented 1 year ago

use https://github.com/opensearch-project/opensearch-catalog project to utilize this api

opensearch-project / sql