Closed stdrc closed 1 month ago
What's more, with the committed watermark information in StateTable, executors no longer need to maintain their own (inaccurate) table_cleaning_watermark state
Any reason why it's inaccurate, or why getting the per-vnode table watermark can be more accurate than reading the state table?
I think the in terms of visibility or freshness of watermark from other parallelisms, writing to a state table should be the same as writing a new per-vnode watermark.
StateTabke.state_cleaning_watermark
and HummockReadVersion.table_watermarks
, because executors don't know whether table.update_watermark
take effect or not.If we can simply table.get_watermark()
, life will be much easierš¤”š„¹
I think the only benefit we can get from this get_watermark
is that we don't need an extra state table to store the watermark, and the rest is the same.
The per-vnode table watermark can actually be treated as a special table. For any specified visibility (committed, or current, or whatever), the watermark we can get from the per-vnode table watermark should be consistent to the watermark we can read from the current watermark state table. So for the issue In this period, executor doesn't know whether it can write changes (may come from other input side than the watermark side) to state table
, the two are the same.
The only inconsistency between them may come from executors don't know whether table.update_watermark take effect or not.
, which is caused by the logic of buffering watermark in state table and delay writing to storage. This logic was previously a workaround for tombstone based range deletes to reduce the range tombstones we write. It can actually be removed when we support per-vnode table watermark now.
In brief, I think it's still doable if it can simplify the code of watermark filter, but it doesn't seem to be a solution to any of the related issues.
The per-vnode table watermark can actually be treated as a special table. For any specified visibility (committed, or current, or whatever), the watermark we can get from the per-vnode table watermark should be consistent to the watermark we can read from the current watermark state table. So for the issue
In this period, executor doesn't know whether it can write changes (may come from other input side than the watermark side) to state table
, the two are the same.The only inconsistency between them may come from
executors don't know whether table.update_watermark take effect or not.
, which is caused by the logic of buffering watermark in state table and delay writing to storage. This logic was previously a workaround for tombstone based range deletes to reduce the range tombstones we write. It can actually be removed when we support per-vnode table watermark now.
The problem is not executors don't know whether table.update_watermark take effect or not.
actually, because we can easily add a method to get state_clean_watermark
from StateTable
. The key problem is about recovery. When recovery happens, executors cannot know the last watermark updated to the state table, so that some decisions cannot be made before receiving the first watermark message.
This is a blocking issue in any executor that uses watermarks one input side to state-clean a state table that materializes another side input.
So to clarify, last updated watermark
or last committed watermark
is part of the executors' state, the lost of the state on recovery is actually a bug, not an inconvenience.
The key problem is about recovery. When recovery happens, executors cannot know the last watermark updated to the state table, so that some decisions cannot be made before receiving the first watermark message.
But things is the same for the per-vnode table watermark. The table watermark and watermark state table are always consistent. They are written to shared buffer, and committed to the LSM manifest (we call in HummockVersion
) at the same time, like in a transaction. After recovery, if we can't know the latest watermark updated to the state table, there is no way to get the latest per table watermark either.
As discussed in https://github.com/risingwavelabs/risingwave/issues/17711, we found that things can be a lot easier if
StateTable
can persist watermark (seems already done as we can query it after #15344) and restore it on recovery.What's more, with the committed watermark information in
StateTable
, executors no longer need to maintain their own (inaccurate)table_cleaning_watermark
state, and they can just blindly write any changes to state table and the latter will help ignore any changes that are below watermark.This can actually gives executors a new coherent view of states, that is, state tables now logically hold all intermediate data generated by executors, and watermarks received on executor level are just hints for the storage to clean states. Executors don't need to care about whether
update_watermark
will really delete entries or not.What do you think? cc @wenym1