risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.97k stars 575 forks source link

nightly-20240130 nexmark-q5-many-windows perf degradation #14990

Closed cyliu0 closed 3 months ago

cyliu0 commented 8 months ago

Describe the bug

image

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/2944#018d5b1a-81cd-4004-b5c1-21cf9024a263

https://grafana.test.risingwave-cloud.xyz/d/liz0yRCZz1/log-search-dashboard?orgId=1&var-data_source=Logging:%20test-useast1-eks-a&from=1706652288000&to=1706654091000&var-namespace=nexmark-bs-0-14-daily-20240130

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

nightly-20240130

Additional context

nightly-20240130

lmatz commented 8 months ago

The throughput increased likely because of 0cd9ff11f0ac5eb10f06a1c2dde2b806334c261c https://github.com/risingwavelabs/risingwave/pull/14558

SCR-20240205-f40 https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240115

now drops because of 9417409957dbdc047cc758b6652ae5134a68a9b4 https://github.com/risingwavelabs/risingwave/pull/14855

14588 vs 14855, what a coincidence

TennyZhuang commented 8 months ago

Interesting, but I can't explain this phenomenon. IMO this PR should only have an effect on UDF :)

fuyufjh commented 8 months ago

Interesting +1, Worth investigating the cause, I think

cyliu0 commented 5 months ago

Any update? The perf never came back to the original level http://metabase.risingwave-cloud.xyz/question/1304-nexmark-q5-many-windows-blackhole-medium-1cn-avg-source-output-rows-per-second-rows-s-history-thtb-266?start_date=2024-01-22

fuyufjh commented 5 months ago

@TennyZhuang Any updates?

fuyufjh commented 5 months ago
CREATE SINK nexmark_q5_many_windows
AS
SELECT
    AuctionBids.auction, AuctionBids.num
FROM (
    SELECT
        bid.auction,
        count(*) AS num,
        window_start AS starttime
    FROM
        HOP(bid, date_time, INTERVAL '5' SECOND, INTERVAL '5' MINUTE)
    GROUP BY
        bid.auction,
        window_start
) AS AuctionBids
JOIN (
    SELECT
        max(CountBids.num) AS maxn,
        CountBids.starttime_c
    FROM (
        SELECT
            count(*) AS num,
            window_start AS starttime_c
        FROM
            HOP(bid, date_time, INTERVAL '5' SECOND, INTERVAL '5' MINUTE)
        GROUP BY
            bid.auction,
            window_start
        ) AS CountBids
    GROUP BY
        CountBids.starttime_c
    ) AS MaxBids
ON
    AuctionBids.starttime = MaxBids.starttime_c AND
    AuctionBids.num >= MaxBids.maxn
WITH ( connector = 'blackhole', type = 'append-only', force_append_only = 'true');

Plan:

 StreamSink { type: append-only, columns: [auction, num, window_start(hidden), window_start#1(hidden)] }
 └─StreamProject { exprs: [$expr1, count, window_start, window_start] }
   └─StreamFilter { predicate: (count >= max(count)) }
     └─StreamHashJoin { type: Inner, predicate: window_start = window_start }
       ├─StreamExchange { dist: HashShard(window_start) }
       │ └─StreamShare { id: 7 }
       │   └─StreamHashAgg [append_only] { group_key: [$expr1, window_start], aggs: [count] }
       │     └─StreamExchange { dist: HashShard($expr1, window_start) }
       │       └─StreamHopWindow { time_col: $expr2, slide: 00:00:05, size: 00:05:00, output: [$expr1, window_start, _row_id] }
       │         └─StreamProject { exprs: [Field(bid, 0:Int32) as $expr1, Field(bid, 5:Int32) as $expr2, _row_id] }
       │           └─StreamFilter { predicate: IsNotNull(Field(bid, 5:Int32)) AND (event_type = 2:Int32) }
       │             └─StreamRowIdGen { row_id_index: 4 }
       │               └─StreamSource { source: nexmark, columns: [event_type, person, auction, bid, _row_id] }
       └─StreamProject { exprs: [window_start, max(count)] }
         └─StreamHashAgg { group_key: [window_start], aggs: [max(count), count] }
           └─StreamExchange { dist: HashShard(window_start) }
             └─StreamShare { id: 7 }
               └─StreamHashAgg [append_only] { group_key: [$expr1, window_start], aggs: [count] }
                 └─StreamExchange { dist: HashShard($expr1, window_start) }
                   └─StreamHopWindow { time_col: $expr2, slide: 00:00:05, size: 00:05:00, output: [$expr1, window_start, _row_id] }
                     └─StreamProject { exprs: [Field(bid, 0:Int32) as $expr1, Field(bid, 5:Int32) as $expr2, _row_id] }
                       └─StreamFilter { predicate: IsNotNull(Field(bid, 5:Int32)) AND (event_type = 2:Int32) }
                         └─StreamRowIdGen { row_id_index: 4 }
                           └─StreamSource { source: nexmark, columns: [event_type, person, auction, bid, _row_id] }
fuyufjh commented 5 months ago

It's indeed caused by https://github.com/risingwavelabs/risingwave/pull/14558, but the reason is unknown. Will continue to investigate.

CPU flamegraph: profile results.zip

fuyufjh commented 3 months ago

It's indeed caused by #14558, but the reason is unknown. Will continue to investigate.

No results.

Let's close the issue as the problem was already solved.