prototyping Nebula Ingestion DDL

chenqin commented 3 years ago

Nebula Ingestion DDL

YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.

Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.

Use an inspiring example that not yet supported by nebula.

User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)

The complexity of this use case comes from three folds

hive table may read data and eventually shard per account basis
kafka stream may need to rpc and convert currency into usd
both kafka stream may need to do stream/table join on per account basis before land result to run slice and dice.

If user write a RDBMS query, it should look like

OPTION 1 Materialized View with schema as part of config

create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)

Alternatively, we can support two statement flow like

OPTION 2 Full Table with schema inference

DDL ` // mapping of hive table sync to nebula table create table hive.account ( accountid bigint PRIMARY KEY, spend double, dt varchat(20) UNIQUE NOT NULL ) with ();

create table kafka.transaction ( transactionid bigint PRIMARY KEY, accountid bigint not null, transaction_amount double, _time timestamp ) with ();

create table transaction_analytic ( accountid bigint PRIMARY KEY, avg_transaction double, transaction_amount_in_usd double, _time timestamp ) with (); `

DML insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;

chenqin commented 3 years ago

support ddl based table ingestion declaration https://github.com/varchar-io/nebula/pull/63

shawncao commented 3 years ago

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

chenqin commented 3 years ago

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

In deed interesting project. Especially declarative MV syntax.

Here is foundation programming diagram it runs upon. https://github.com/TimelyDataflow

a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.

starts with common use case like stream to materialized table join which users can run out of box with minimal learning needed.
leverage built in columnar storage to filter efficient one pass hash join
use omp to take advantage of parallel perf gain in heterogeneous hardware architecture.

chenqin commented 3 years ago

A bit related topic. from adoption point of view, having a UI to d'n'd would be significant easier for end user compare to SQL and magnitudes compare to even python

My old team open sourced UI framework flow builder https://github.com/chenqin/react-digraph

chenqin commented 3 years ago

FYI. This is interesting, we should talk about it later - https://materialize.com/docs/get-started/

In deed interesting project. Especially declarative MV syntax.

Here is foundation programming diagram it runs upon. https://github.com/TimelyDataflow

a related industry oriented thought would be - robust full featured stream stream join is tricky and complicated at scale.

starts with common use case like stream to materialized table join which users can run out of box with minimal learning needed.

leverage built in columnar storage to filter efficient one pass hash join

use omp to take advantage of parallel perf gain in heterogeneous hardware architecture.

something related from splunk https://twitter.com/esammer/status/1343675579850113024

chenqin commented 3 years ago

I put a bit more on ingestion SQL during vacation. There is good amount of prototyping and experiment work needed before propose a complete solution. I plan to take a long shot and write a stand alone prototype.

varchar-io / nebula