risingwavelabs / risingwave-docs

The official user documentation of RisingWave
https://docs.risingwave.com
Apache License 2.0
33 stars 26 forks source link

A comparison between streaming ingestion vs batch ingestion #2012

Open lmatz opened 3 months ago

lmatz commented 3 months ago

After we supported batch source read and reuse several concepts in RW, it is crucial that users can easily differentiate them and understand the different semantics of various SQL statements, i.e. streaming read, batch source read

link: https://github.com/risingwavelabs/risingwave-docs/issues/2013

Draft

Risingwave supports two different options for ingesting data from external sources. One is streaming ingestion for unbounded data, i.e. an ever-growing, essentially infinite data set. The other one is batch ingestion for bounded data, i.e. an finite data set.

Streaming Ingestion

As a streaming database, it's only natural to support streaming ingestion. This is because, when developing a streaming application, we anticipate the data from external sources to flow endlessly. It's also the primary reason users opt for a streaming database in the first place.

Whether users intend to read from a message queue like Kafka or synchronize change data captured from an OLTP database such as PostgreSQL/MySQL, Risingwave offers two SQL primitives for defining streaming ingestion: (1) create table T (...) with (...) and (2) create source S (...) with (...).

When users issue create table T (...) with (...), RW immediately polls the data from the specified source in the with clause and stores it in table T. As it's a table, users can freely insert records into table T, delete records from table T, and update records stored in table T.

On the other hand, when users issue create source S (...) with (...), RW doesn't take action immediately. Only when users define a materialized view or a sink involving S will the data from the specified source in the with clause be ingested. However, none of this data gets stored; it's solely utilized for computing the SQL defining the materialized view or sink.

Batch Ingestion

While streaming ingestion remains the core focus in Risingwave, there are occasions when we require a one-time retrieval of a static snapshot from external sources. For instance, we might need to fetch files from a local filesystem or object storage services like S3, or perhaps ingest a snapshot of files from a datalake such as Iceberg. This often happens when users have both a batch processing pipeline and a streaming processing pipeline in their technology stack

To maintain simplicity without adding unnecessary complexity, we repurpose one of our primitives, namely create source S (...) with (...), to enable users to define batch ingestion. This process involves three steps:

  1. Use create source S (...) with (...); to specify the external sources for batch ingestion.
  2. Employ create table T; to establish a table for storing the ingested data. It's important to note that this table is plain and lacks any associated with clause.
  3. Execute insert into T select * from S; to retrieve data from external source S and insert it into table T. This batch query is crucial for triggering the batch ingestion process.

At present, Risingwave facilitates batch ingestion from three types of external sources:

  1. Posix file system
  2. S3-compatible object storage
  3. Iceberg

One question that may arise is: how do we determine which files will be ingested when issuing the insert into T select command, especially if new files continue to be added to these sources and we wish to ingest a specific subset of files?

lmatz commented 3 months ago

This is the current draft, there are still some missing pieces in terms of the funciotnality, let me complete it later when they are done