risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.87k stars 569 forks source link

2024-02-18 `Nexmark` performance regression `Source Throughput Imbalance` #15142

Open lmatz opened 7 months ago

lmatz commented 7 months ago

SCR-20240220-e03

https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300808010129

nexmark-q0-blackhole-4x-medium-1cn-affinity scaling up 4X: http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21

nexmark-q0-blackhole-medium-4cn-1node-affinity scaling out 4X: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

nexmark-q7-blackhole-medium-1cn-affinity baseline: http://metabase.risingwave-cloud.xyz/question/1502-nexmark-q7-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-190?start_date=2023-09-08

nexmark-q17-blackhole-4x-medium-1cn-affinity scaling up 4X: http://metabase.risingwave-cloud.xyz/question/9270-nexmark-q17-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2767?start_date=2024-01-04

q0 is a stateless query that does 0 computation.

These are affinity settings.

lmatz commented 7 months ago

https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300803851039

SCR-20240220-fmd

These nexmarks are 1 machine for CN and 1 machine for compactor setting.

lmatz commented 7 months ago

For nexmark-q0-blackhole-4x-medium-1cn-affinity, aka scaling up 4X setting: http://metabase.risingwave-cloud.xyz/question/11046-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-11

02-10: SCR-20240220-ify

02-17: SCR-20240220-ig7

For nexmark-q0-blackhole-medium-4cn-1node-affinity, aka scaling out setting: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

02-10: SCR-20240220-iru

02-17: SCR-20240220-is1

scaling out shows the additional problem that the CPU usage across different compute nodes are uneven. Looking into it....

But anyway, for the baseline 1cn setting: http://metabase.risingwave-cloud.xyz/question/36-nexmark-q0-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-169?start_date=2023-08-28

02-10 & 02-17 has no difference.
Besides, q0 is a stateless query that does 0 computation

Therefore, considering these two factors, I am more suspicious if there is anything in the testing environment that leads to this regression? cc: @huangjw806

Although we cannot completely rule out the possibility of root cause in the kernel. Looking into it.

lmatz commented 7 months ago

Just triggered a test with nightly-20240210 to verify if it is kernel or env's problem:

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3084

this is the scaling-out setting: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11

It seems to be not an issue from kernel, cc: @huangjw806

both nightly-20240217 and the new ad-hoc one nightly-20240210 run by me are slower than before, both 3M row/s versus 3.7 rows/s before

lmatz commented 7 months ago

However, just triggered annother test with nightly-20240210 for the scaling-up setting: http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3086

The throughput goes up to the previous stable number again.

I am confused......

But I think both env and kernel are both worth investigating

We remark that both setting once reach a even higher number and fall back.

huangjw806 commented 7 months ago

if there is anything in the testing environment that leads to this regression?

It looks like the test environment is no different.

st1page commented 7 months ago

For nexmark-q0-blackhole-4x-medium-1cn-affinity, aka scaling up 4X setting:

It is because of the imbalance of consumption from splits. You can see in the left graph, that some split has higher throughput and their events were consumed out very early. So in the second half of the test, there are not enough active splits to reach the biggest throughput.

image

Maybe with more CN resources, the kafka's bottleneck( AWS EBS) becomes more significant

Maybe related. https://github.com/risingwavelabs/risingwave/issues/5214

lmatz commented 7 months ago

Yeah, does the uneven CPU usage among all the compute nodes under the scaling-out setting imply that the number of splits are uneven among all the compute nodes?

Mark it high-priority as it may make a lot of other evaluation difficult to reason about.

lmatz commented 7 months ago

Just discussed with @huangjw806 this afternoon that

The current machine setting: RW's cn and compactor runs on 32c64g(c6i.8xlarge) while kafka runs on 8c16g(c6i.2xlarge) SCR-20240221-g7f

Note that the network bandwidth of Kafka machine is up to 12.5 instead of straight 12.5. Per my understanding, there are some certain limitations of peak bandwidth: https://stackoverflow.com/questions/71443685/meaning-of-up-to-10-gbps-bandwidth-in-ec2-instances. It can only be offered for a certain minutes, or some other strange rules.

Consider that the peak throughput we get from RW on the dashboard is around 1500MB/s(some time a little bit over 1500MB/s), aka 12Gbps

We want to rule out the possibility that the imbalance is due to the up to 12.5 limitation.

@huangjw806 is helping getting the new number by upgrading Kafka machine from c6i.2xlarge to c6i.8xlarge too

github-actions[bot] commented 3 months ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.