risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer efficient joins, instant failover, dynamic scaling, speedy bootstrapping, and concurrent query serving.
https://www.risingwave.com/slack
Apache License 2.0
6.65k stars 545 forks source link

Discussion: Provide a WAL in RisingWave for source and DML #16772

Open kwannoel opened 2 months ago

kwannoel commented 2 months ago

Currently as much as possible, we depend on upstream system's properties to ensure our exactly once processing. Take mysql / PG for instance, we need to depend on their WAL log.

If we provide our own WAL, we don't depend on external systems for the exactly once processing part. This reduces complexity and increases maintainability.

For DML, we can also provide the exactly once processing. Currently after running DML, it may not get committed immediately. If recovery happens, we lose the records.

fuyufjh commented 2 months ago

I just realized that log store seems not a good choice. Log store, as part of the Hummock, relies on the 1-time-per-second checkpoints to commit (i.e. make sure it's persisted and won't be lost in any way). While, in this case, we want to commit the DML changes as fast as possible, the 1-second commit latency sounds too long.


Then it comes back to the very early discussion - shall we set up a Kafka before RW to hold the DML requests? That is, when the frontend node accepts DML statements, it writes it into Kafka and return OK to users as long as the Kafka producer acknowledges.

BEFORE:  DML statements -> frontend node -> compute nodes
AFTER:   DML statements -> frontend node -> Kafka WAL -> compute nodes

Of course Kafka is not necessary. As long as something can provide such ability.

StrikeW commented 2 months ago

Take mysql / PG for instance, we need to depend on their WAL log. If we provide our own WAL, we don't depend on external systems for the exactly once processing part. This reduces complexity and increases maintainability.

Suppose we have our own WAL, but we are still the consumer of upstream WAL which means we still need to depend on the WAL in external system. For example, when a recovery occurs, we still need to reset the upstream offset and resume the consumption .

kwannoel commented 1 month ago

I think most important is for DML WAL. Including Kafka to use as WAL means our deployments e.g. cloud will also need to package it. So I don't think it will work. We need to either build our own, or find an off the shelf version.

Any ideas @chenzl25 @wenym1?

chenzl25 commented 1 month ago

Because of our decoupled compute and storage and shared storage architecture, a single DML latency will inevitably be high if we try to implement the WAL in the shared storage layer. I also feel that introducing Kafka is too heavy. Databases such as AWS Aurora, which share a similar architecture with ours, do not specifically mention the latency of their DML operations but emphasize throughput instead.