After we supported batch source read and reuse several concepts in RW, it is crucial that users can easily differentiate them and understand the different semantics of various SQL statements, i.e. streaming read, batch source read
Risingwave supports two different options for ingesting data from external sources. One is streaming ingestion for unbounded data, i.e. an ever-growing, essentially infinite data set. The other one is batch ingestion for bounded data, i.e. an finite data set.
Streaming Ingestion
As a streaming database, it's only natural to support streaming ingestion. This is because, when developing a streaming application, we anticipate the data from external sources to flow endlessly. It's also the primary reason users opt for a streaming database in the first place.
Whether users intend to read from a message queue like Kafka or synchronize change data captured from an OLTP database such as PostgreSQL/MySQL, Risingwave offers two SQL primitives for defining streaming ingestion: (1) create table T (...) with (...) and (2) create source S (...) with (...).
When users issue create table T (...) with (...), RW immediately polls the data from the specified source in the with clause and stores it in table T. As it's a table, users can freely insert records into table T, delete records from table T, and update records stored in table T.
On the other hand, when users issue create source S (...) with (...), RW doesn't take action immediately. Only when users define a materialized view or a sink involving S will the data from the specified source in the with clause be ingested. However, none of this data gets stored; it's solely utilized for computing the SQL defining the materialized view or sink.
Batch Ingestion
While streaming ingestion remains the core focus in Risingwave, there are occasions when we require a one-time retrieval of a static snapshot from external sources. For instance, we might need to fetch files from a local filesystem or object storage services like S3, or perhaps ingest a snapshot of files from a datalake such as Iceberg. This often happens when users have both a batch processing pipeline and a streaming processing pipeline in their technology stack
To maintain simplicity without adding unnecessary complexity, we repurpose one of our primitives, namely create source S (...) with (...), to enable users to define batch ingestion. This process involves three steps:
Use create source S (...) with (...); to specify the external sources for batch ingestion.
Employ create table T; to establish a table for storing the ingested data. It's important to note that this table is plain and lacks any associated with clause.
Execute insert into T select * from S; to retrieve data from external source S and insert it into table T. This batch query is crucial for triggering the batch ingestion process.
At present, Risingwave facilitates batch ingestion from three types of external sources:
Posix file system
S3-compatible object storage
Iceberg
One question that may arise is: how do we determine which files will be ingested when issuing the insert into T select command, especially if new files continue to be added to these sources and we wish to ingest a specific subset of files?
After we supported batch source read and reuse several concepts in RW, it is crucial that users can easily differentiate them and understand the different semantics of various SQL statements, i.e. streaming read, batch source read
link: https://github.com/risingwavelabs/risingwave-docs/issues/2013
Draft
Risingwave supports two different options for ingesting data from external sources. One is streaming ingestion for unbounded data, i.e. an ever-growing, essentially infinite data set. The other one is batch ingestion for bounded data, i.e. an finite data set.
Streaming Ingestion
As a streaming database, it's only natural to support streaming ingestion. This is because, when developing a streaming application, we anticipate the data from external sources to flow endlessly. It's also the primary reason users opt for a streaming database in the first place.
Whether users intend to read from a message queue like Kafka or synchronize change data captured from an OLTP database such as PostgreSQL/MySQL, Risingwave offers two SQL primitives for defining streaming ingestion: (1)
create table T (...) with (...)
and (2)create source S (...) with (...)
.When users issue
create table T (...) with (...)
, RW immediately polls the data from the specified source in the with clause and stores it in table T. As it's a table, users can freely insert records into table T, delete records from table T, and update records stored in table T.On the other hand, when users issue
create source S (...) with (...)
, RW doesn't take action immediately. Only when users define a materialized view or a sink involving S will the data from the specified source in the with clause be ingested. However, none of this data gets stored; it's solely utilized for computing the SQL defining the materialized view or sink.Batch Ingestion
While streaming ingestion remains the core focus in Risingwave, there are occasions when we require a one-time retrieval of a static snapshot from external sources. For instance, we might need to fetch files from a local filesystem or object storage services like S3, or perhaps ingest a snapshot of files from a datalake such as Iceberg. This often happens when users have both a batch processing pipeline and a streaming processing pipeline in their technology stack
To maintain simplicity without adding unnecessary complexity, we repurpose one of our primitives, namely
create source S (...) with (...)
, to enable users to define batch ingestion. This process involves three steps:create source S (...) with (...);
to specify the external sources for batch ingestion.create table T;
to establish a table for storing the ingested data. It's important to note that this table is plain and lacks any associated with clause.insert into T select * from S;
to retrieve data from external source S and insert it into table T. This batch query is crucial for triggering the batch ingestion process.At present, Risingwave facilitates batch ingestion from three types of external sources:
One question that may arise is: how do we determine which files will be ingested when issuing the insert into T select command, especially if new files continue to be added to these sources and we wish to ingest a specific subset of files?