risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.04k stars 579 forks source link

test: recovery test for sources #16356

Open tabVersion opened 7 months ago

tabVersion commented 7 months ago

In our previous tests, we did not thoroughly validate whether each source connector could resume consumption from any specific record position while ensuring exactly-once processing. The primary challenges in achieving this were the need to restart clusters within the CI environment to perform recovery operations and the inability to control the consumption rate to manage progress.

The absence of specific tests led to unintentional breaches of the exactly-once semantics when building new sources. This has been a critical issue, especially since we identified that the fs source connector could duplicate reading the current message during recovery, thanks to @stdrc . By implementing these tests, we aim to strengthen the guarantees around exactly-once semantics across all connectors.

We now support triggering recovery via SQL commands (https://github.com/risingwavelabs/risingwave/pull/16259), controlling read speeds with rate limits (https://github.com/risingwavelabs/risingwave/pull/15948), and supporting truncation at any position for chunks. Our new support for bash commands in the slt (https://github.com/risingwavelabs/risingwave/issues/12451#issuecomment-2051861048) makes it easier to control external components. The inclusion of 'key as' syntax allows clear marking of each message's offset to check for overlaps (https://github.com/risingwavelabs/risingwave/pull/13707).

To test recovery, we can use a smaller dataset, for example, 20 messages in Kafka, setting stream parallelism to 1 and streaming_rate_limit to 1. We will trigger recovery at any time between 0-20 seconds to ensure that all 20 messages are read correctly and to verify that there are no duplicates in the offsets. The same test applies to the fs source; if a data line is read twice, the recorded offset will be the same.

This test will help ensure that all our connectors meet the exactly-once requirements, safeguarding the integrity of our data processing systems.

kwannoel commented 7 months ago

Agree! Besides e2e unit tests, I also would like to have fuzzing similar to that of e2e deterministic recovery test.

kwannoel commented 6 months ago

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline. After running the recovery part, sleep for some duration, and check that the records should have been ingested still.

Test Variables:

  1. DDL: create sink, create source + mv, create table with source.
  2. recovery: crash loop scenario (keep triggering restarts), long recovery (20mins).
xxchan commented 6 months ago

I've managed to using the RECOVER command to write recovery tests https://github.com/risingwavelabs/risingwave/pull/16733/files#diff-b83fa16ce469bbf54f92cf95f9d778e4cd0acf6a20489589c8bb636b6162da02

xxchan commented 6 months ago

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.

To do this, sqllogictest need to be enhanced, because the connection to fe will be disconnected (https://github.com/risingwavelabs/risingwave/issues/12451#issuecomment-2060816396)

tabVersion commented 6 months ago

We can use inline system commands like ./risedev k and ./risedev d to start and stop the cluster inline.

To do this, sqllogictest need to be enhanced, because the connection to fe will be disconnected (#12451 (comment))

IIRC, the recovery process just happens in meta inner, any connection to the fe will not be influenced.

st1page commented 5 months ago

Motivation The absence of specific tests led to unintentional breaches of the exactly-once semantics when building new sources. This has been a critical issue, especially since we identified that the fs source connector could duplicate reading the current message during recovery, thanks to @stdrc . By implementing these tests, we aim to strengthen the guarantees around exactly-once semantics across all connectors.

In addition to the exactly-once semantic guarantee, it is also meaningful to simply test whether re-creating the subscription to the upstream system after recovery will cause problems, such as the issue in https://github.com/risingwavelabs/risingwave/pull/17112