risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.78k stars 561 forks source link

ch-benchmark q6 mv has consistently throughput after the source stopped producing data #18055

Closed cyliu0 closed 1 week ago

cyliu0 commented 1 month ago

Describe the bug

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&var-namespace=ch-benchmark-pg-cdc-daily-20240814&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=now-12h&to=now&viewPanel=28

https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22uvc%22:%7B%22datasource%22:%22PE59595AED52CF917%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22ch-benchmark-pg-cdc-daily-20240814%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22PE59595AED52CF917%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221723653436544%22,%22to%22:%221723738391728%22%7D%7D%7D&orgId=1

image

https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/356#019152e8-ea6e-4ee6-83dc-b404ea230cd6

We have hit this before with nightly-20240808. It's 100% reproducible. https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/352

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240814

Additional context

No response

cyliu0 commented 1 month ago

Should be introduced by one of commits in https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240808

StrikeW commented 3 weeks ago

@KeXiangWang could you help to take a look? The problem is q6 has continuous throughput (2 rows/sec) even the cdc source has finished loading all data from upstream, so that the test job cannot finish. The test script expects all mv throughput decrease to 0.

cyliu0 commented 2 weeks ago

This bug still exists in the upgrade test for v2.0.0-rt.2 https://buildkite.com/risingwave-test/upgrade/builds/59#0191a228-107d-4a5f-9305-ca32872835b0

image
lmatz commented 1 week ago

is the problem solved by https://github.com/risingwavelabs/risingwave/pull/18307 and https://github.com/risingwavelabs/risingwave/pull/18303

close the issue now, we reopen if wrong

KeXiangWang commented 1 week ago

I have figured out the root cause. Starts from 17945, RW outputs noop rows every barrier in some cases. The fix is 18292, which has been included in v2.0.0-rc.2. So if you run ch-benchmark-pg-cdc q1-12 daily test directly with v2.0.0-rc.2, it will succeed.

However, the problem is fixed by adding noop_update_hint: true to the physical graph. As a result, if user create a MV with an older version (< v2.0.0), the physical graph would be the old version without noop_update_hint: true. When the user upgrade the cluster to v2.0.0, it would output noop rows because of the old version physical graph following the changes in 17945. cc @stdrc @kwannoel any ideas to fix this issue? Maybe it's also fine to keep it, as the extra rows are just noop actually.