Use Cases

https://vision.cloudera.com/apache-kudu-top-use-cases-for-real-time-analytics/

Time Series Data

Data that arrives sequentially and delivers a discrete point-in-time measurement is known as time series data. By continuously appending the latest measurement to the historic set of points, we can see trends emerge in the data. Kudu enables that data to be appended in real time, and gives us the ability to run analytics on the data. That analytical capability can help pivot time series data sets from post-mortem data – analyzing what went wrong after it happens – to predictive data that enables action before an adverse event occurs.

Examples: Market data streaming, internet of things (IoT), connected cars, fraud detection/prevention, risk monitoring

Machine Data Analytics

Machine data analytics refers to the data that your network, computers, and users are generating as they go about their daily business. In the best of times, this information is mundane. However, in stormier weather, it can create a map that leads to bad actors, bottlenecks in your infrastructure, and potential problems with your enterprise apps. With Kudu, real-time analytics puts this map in your hands early, providing a guide to “what’s happening” as opposed to “what’s happened”.

Examples: Network threat detection, network health monitoring, advanced persistent threats (APT), cybersecurity, application performance monitoring

Online Reporting

Online reporting – such as an operational data store – has traditionally been limited by data volume and analytic capability. Keeping long histories of data was prohibitively expensive, and analytical capabilities were the domain of data warehouses. However, with Kudu, online reporting can now be real-time, store all historic data with complete fidelity, and provide analytical analysis.

Example: Operational Data Store

In summary, Kudu expands the functionality of the Hadoop ecosystem by providing relational storage for fast analytics on fast data. This opens up the ability to do specific use cases in an easier and more broadly-implemented manner. Furthermore, it rounds out the full set of storage options available from Cloudera, which now includes HDFS, Apache HBase (NoSQL), Kudu (relational), and cloud-based object storage. This enables clients to easily move between the storage type their use case demands without the need to retrain users on the platform the data resides within.

History

Lambda Architecture

All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.

小米的数据处理架构 link

Kappa Architecture

link

其他系统

hive

在 hadoop 生态圈上构建的 SQL 引擎, 本身并不支持数据的存储, 只支持特定格式的数据导入至 HDFS. 关于在数据上做 map reduce 还是在 RDBMS 上做分析, 早已有许多争论.

Paper: A comparison of approaches to large-scale data analysis
Stonebraker on NoSQL and enterprises link

当前工业界也逐步从 MR 回归 RDBMS. pig 也是与 hive 类似的产品.

Kudu 也能支持在 hive 下做 SQL 查询.

bigquery / dremel

底下是 dremel，只支持 read only。 dremel 的存储格式被 apache/parquet 沿用作为单独一个项目。

mesa

link mesa 的需求也是强一致, 实时(near real time)更新, 强调查询性能, 可扩展性. 与 kudu 基本一致, 上面可使用 F1 和 dremel 进行 SQL 查询.

为了实现强一致需求, kudu 与 mesa 都维护多版本数据. 不过 mesa 的数据存在 Colususs.

kudu 实现了一套复杂的 compaction policy，mesa 使用 two-level compaction，数据首先写到 singleton，10 个 singleton 聚合成一个 cumulative delta，每天会定期将 cumulative 聚合成 base data。

	meta data	replica data	local store
kudu	raft-master	raft-tablet	CFile
mesa	bigtable-master	colususs	unknown
palo	berkeleydb-master	lack replication	parquet + ORCFile

Palo link: Mesa 的开源实现.

druid

近实时查询方面几乎与 kudu 面对同样的受众.

druid 与 kudu 的对比 link

强制以 timestamp 做 key，例如 5 分钟内数据落到 real time nodes，5 分钟后数据刷入 historic nodes，由于 5 分钟后的数据不会再有改动，可以认为 historic nodes 的数据都是永久不变的，利用这个特点可以做 query cache，scan 时不需要查 delta store 也可以提升 performance。
数据更新较慢, 但由于 druid 适用于时序型数据, 所以基本不考虑更新代价.
kudu 只是存储引擎, 可适配不同计算框架 (hive, spark, MR, impala), druid 自带查询引擎 (但未必性能更好?).
kudu 的部署相比于 druid 而言及其简单 (kudu master, replica vs zookeeper, mysql, broker, coordinator, real time node, historic node)

phoenix

Why Column Store?

Paper: Column-Stores vs. Row-Stores: How Different Are They Really?

Q: 为什么不使用 rocksdb/HBase column family 实现列存?
Q: Are these performance gains due to something fundamental about the way column-oriented DBMSs are internally architected, or would such gains also be possible in a conventional system that used a more column-oriented physical design?

这篇 paper 的贡献在于实验了：

在行存(System-X)底下以列进行存储，测试行存能否通过优化具备列存的性能。
而又在列存(C-Store)底下一个个去掉优化手段，比较这些优化的作用

T is traditional, T(B) is traditional (bitmap), MV is materialized views, VP is vertical partitioning, and AI is all indexes.

这里有四张表, 每张表分别有1个属性做 join. 这张图显示了, 将单行按属性拆成多列并暴力两两做 hash join 并不一定能提升性能, 直接使用传统行存反而性能不差, 最好的方案是 materialize view, 去掉多余的列, 效率最高.

VP 方案需要在每列存 primary key 做 key, column data 做 value, 论文得出的结论是 primary key 是一大部分开销 (tuple-overhead). 另外每列仍然按 primary key 做排序, 而不是按 column data 排序, 这对 join 很不友好. 如果能够按 column data 排序, join 时就可以使用 merge join 而不是 hash join.

T=tuple-at-a-time processing, t=block processing; I=invisible join enabled, i=disabled; C=compression enabled, c=disabled; L=late materialization enabled, l=disabled.

可见 compression 和 late materialization 是提升最显著的两个优化.

Lazy Materialization

Paper: Materialization Strategies in a DBMS Materialization

  // Add a predicate: WHERE key1 >= 5
  KuduPredicate* p = table->NewComparisonPredicate(
      "key1", KuduPredicate::GREATER_EQUAL, KuduValue::FromInt(kLowerBound));
  KUDU_RETURN_NOT_OK(scanner.AddConjunctPredicate(p));

  // Add a predicate: WHERE key2 <= 600
  p = table->NewComparisonPredicate(
      "key2", KuduPredicate::LESS_EQUAL, KuduValue::FromInt(kUpperBound));
  KUDU_RETURN_NOT_OK(scanner.AddConjunctPredicate(p));

原本是读到某一行就提取相应列, 在两两 join 时聚合成一个临时(intermediate) tuple, 这种做法叫做 early materialization.

late materialization 会先按 predicate 进行 filtering, 筛掉不必要的记录后再做 join.

优点是:

列是经过压缩的, EM 需要解压
只需要提取需要的 tuples 缺点是列需要扫两遍.

Compression

Paper: Integrating compression and execution in column-oriented database systems

Further Optimization

Kudu’s on-disk data format closely resembles Parquet, with a few differences to support efficient random access as well as updates. The underlying data is not directly queryable without using the Kudu client APIs. The Kudu developers have worked hard to ensure that Kudu’s scan performance is performant, and has focused on storing data efficiently without making the trade-offs that would be required to allow direct access to the data files.

Tuple construction

Transaction

kudu 本身只支持 single-tablet 的事务，并且由于不支持 get 操作，本质上只支持读事务，写事务，不支持读写事务。所以可以认为 kudu 的 transaction 是一个为了让受众更易理解的 write batch.

neverchanje / notes

kudu #1