Support scaling of backfill jobs

kwannoel commented 3 weeks ago

We don't support scaling backfill jobs at the moment. So when we scale backfill jobs running under background ddl, they will trigger an assertion panic as mentioned by https://github.com/risingwavelabs/risingwave/issues/19138.

This is prerequisite of serverless backfill and for improved ux of background_ddl, so we should support it.

kwannoel commented 3 weeks ago

cc @shanicky, in case you already have ideas about supporting it, or have already started work on it before, so we avoid double work on it.

xxchan commented 3 weeks ago

This is a little surprising, since I thought it's already supported. I thought we disallow foreground job to be scaled, but allow background ddls to be scaled.

kwannoel commented 3 weeks ago

This is a little surprising, since I thought it's already supported. I thought we disallow foreground job to be scaled, but allow background ddls to be scaled.

Unfortunately not. That should be supported, but at that point in time, several logic needs to be changed in the etcd backend and the sql backend, and background_ddl wasn't too high priority. So it's deferred to now.

Priority is increased since the logic will possibly be used for serverless backfill as well.

fuyufjh commented 1 week ago

Talked with @shanicky yesterday, and he mentioned that the major blocking point is that the barrier manager must maintain the set of the under-creating actors, so that these actors will be treated specially i.e. excluded from the normal barrier collection process.

This reminds me of #17735, where @wenym1 proposed a partial checkpoint based backfilling approach called snapshot backfill.

From my understanding, after #17735, there will be a dedicated checkpointing partition for the under-creating actors (a subgraph). Thus, "the set of the under-creating actors" is naturally maintained by it, so the DDL process only needs to wait for the process to complete.

Please correct me if anything's wrong.

wenym1 commented 1 week ago

the barrier manager must maintain the set of the under-creating actors, so that these actors will be treated specially i.e. excluded from the normal barrier collection process.

For a backfilling job, currently in barrier manager, we have a CreateMviewProgressTracker to track the progress and mark the streaming job as created in catalog afterward. In the tracker, there is the set of under-creating actors to track the backfill progress, which can be even recovered in recovery.

In the original issue https://github.com/risingwavelabs/risingwave/issues/19138, the reason for the panic is because of the assumption that for an backfill executor, either all vnode has finished backfill, or none of them has finished, which does not hold during scale where vnode distribution is updated. Some parallelisms of backfill executor may have finished backfill, while others hasn't, and after scale, a backfill executor may see some of its vnodes have finished backfill, while others haven't. If we can update our code to get rid of this assumption, I think we can at least support scaling the backfill jobs by doing offline scale.

wenym1 commented 1 week ago

so that these actors will be treated specially i.e. excluded from the normal barrier collection process.

I think this is not correct. Currently the normal backfilling jobs (not including snapshot backfill) are treated the same as created jobs, and are a part of the global streaming graph, and the barriers are injected and collected together as a whole.

kwannoel commented 1 week ago

the barrier manager must maintain the set of the under-creating actors, so that these actors will be treated specially i.e. excluded from the normal barrier collection process.

For a backfilling job, currently in barrier manager, we have a CreateMviewProgressTracker to track the progress and mark the streaming job as created in catalog afterward. In the tracker, there is the set of under-creating actors to track the backfill progress, which can be even recovered in recovery.

In the original issue #19138, the reason for the panic is because of the assumption that for an backfill executor, either all vnode has finished backfill, or none of them has finished, which does not hold during scale where vnode distribution is updated. Some parallelisms of backfill executor may have finished backfill, while others hasn't, and after scale, a backfill executor may see some of its vnodes have finished backfill, while others haven't. If we can update our code to get rid of this assumption, I think we can at least support scaling the backfill jobs by doing offline scale.

I have a PR open which can support that in the arrangement backfill executor: https://github.com/risingwavelabs/risingwave/pull/15823/files.

I think offline scale is possible. And I think online scale is not really supported at the moment, because let's say we update a CN resource on the cloud side, it will trigger a recovery, because the CN node will be recreated. For example, upgrade 1CN 2c8g -> 1CN 4c16g. So to me it only makes sense to support scaling in background_ddl, in an offline way.

risingwavelabs / risingwave

Support scaling of backfill jobs #19142