risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.06k stars 581 forks source link

feat(meta): decouple barrier collect and sync in global barrier manager #19475

Open wenym1 opened 2 days ago

wenym1 commented 2 days ago

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously in global barrier manager, barriers are handled in 2 phases. In phase 1, a barrier is injected to CNs, and waits for the BarrierCompleteResponse from the injected CNs. When the response is received, the barrier has been collected from all actors in the CN, and the epoch of the barrier has been synced and the SST files are included in the response. In phase 2, the SST files will be committed to hummock manager, and the barrier command will commit the change to fragment and catalog manager if there is any.

In partial checkpoint, barriers are injected to multiple partial graphs, and then multiple partial graphs can be synced and then committed together. Therefore, in this PR, we will decouple the barrier collection and sync in global barrier manager. In general, a barrier will have 3 phase, collect, complete, and commit. A barrier will be handled in the following steps:

  1. The barrier is injected to CNs in multiple partial graphs and enters the collect phase to wait for the BarrierCollectResponse. The barrier injection and collection is independent in different partial graphs.
  2. For all the partial graphs and epochs that have collected the BarrierCollectResponse from all CNs, the global barrier worker will generate a CompleteBarrierTask that includes them together. These partial graphs and epochs will enter the complete phase. BarrierCompleteRequest will be sent to CNs and wait for the responses. When CN receives the request, multiple partial graphs will be synced together.
  3. When the BarrierCompleteResponse is received from all CNs, the partial graphs enters the commit phase and will be committed together to hummock, fragment manager and catalog manager.

Note:

Checklist

Documentation

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

stdrc commented 4 hours ago

Is this ready for review? Please add some description.

wenym1 commented 4 hours ago

Is this ready for review? Please add some description.

Not yet. Still running some tests. Will ping reviewers when it's ready for review

wenym1 commented 4 hours ago

This stack of pull requests is managed by Graphite. Learn more about stacking.