Open lmatz opened 7 months ago
https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300803851039
These nexmarks are 1 machine for CN and 1 machine for compactor setting.
For nexmark-q0-blackhole-4x-medium-1cn-affinity
, aka scaling up 4X setting: http://metabase.risingwave-cloud.xyz/question/11046-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-11
02-10:
02-17:
For nexmark-q0-blackhole-medium-4cn-1node-affinity
, aka scaling out setting: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11
02-10:
02-17:
scaling out shows the additional problem that the CPU usage across different compute nodes are uneven. Looking into it....
But anyway, for the baseline 1cn setting: http://metabase.risingwave-cloud.xyz/question/36-nexmark-q0-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-169?start_date=2023-08-28
02-10 & 02-17 has no difference.
Besides, q0 is a stateless query that does 0 computation
Therefore, considering these two factors, I am more suspicious if there is anything in the testing environment that leads to this regression? cc: @huangjw806
Although we cannot completely rule out the possibility of root cause in the kernel. Looking into it.
Just triggered a test with nightly-20240210
to verify if it is kernel or env's problem:
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3084
this is the scaling-out setting: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11
It seems to be not an issue from kernel, cc: @huangjw806
both nightly-20240217
and the new ad-hoc one nightly-20240210
run by me are slower than before, both 3M row/s versus 3.7 rows/s before
However, just triggered annother test with nightly-20240210
for the scaling-up setting:
http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21
https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3086
The throughput goes up to the previous stable number again.
I am confused......
But I think both env and kernel are both worth investigating
We remark that both setting once reach a even higher number and fall back.
if there is anything in the testing environment that leads to this regression?
It looks like the test environment is no different.
For
nexmark-q0-blackhole-4x-medium-1cn-affinity
, aka scaling up 4X setting:
It is because of the imbalance of consumption from splits. You can see in the left graph, that some split has higher throughput and their events were consumed out very early. So in the second half of the test, there are not enough active splits to reach the biggest throughput.
Maybe with more CN resources, the kafka's bottleneck( AWS EBS) becomes more significant
Maybe related. https://github.com/risingwavelabs/risingwave/issues/5214
Yeah, does the uneven CPU usage
among all the compute nodes under the scaling-out setting imply that the number of splits are uneven among all the compute nodes?
Mark it high-priority as it may make a lot of other evaluation difficult to reason about.
Just discussed with @huangjw806 this afternoon that
The current machine setting: RW's cn and compactor runs on 32c64g(c6i.8xlarge) while kafka runs on 8c16g(c6i.2xlarge)
Note that the network bandwidth of Kafka machine is up to 12.5
instead of straight 12.5
. Per my understanding, there are some certain limitations of peak bandwidth: https://stackoverflow.com/questions/71443685/meaning-of-up-to-10-gbps-bandwidth-in-ec2-instances. It can only be offered for a certain minutes, or some other strange rules.
Consider that the peak throughput we get from RW on the dashboard is around 1500MB/s(some time a little bit over 1500MB/s), aka 12Gbps
We want to rule out the possibility that the imbalance is due to the up to 12.5
limitation.
@huangjw806 is helping getting the new number by upgrading Kafka machine from c6i.2xlarge to c6i.8xlarge too
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.
https://risingwave-labs.slack.com/archives/C04R6R5236C/p1708300808010129
nexmark-q0-blackhole-4x-medium-1cn-affinity
scaling up 4X: http://metabase.risingwave-cloud.xyz/question/12347-nexmark-q0-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2730?start_date=2024-01-21nexmark-q0-blackhole-medium-4cn-1node-affinity
scaling out 4X: http://metabase.risingwave-cloud.xyz/question/11043-nexmark-q0-blackhole-medium-4cn-1node-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2543?start_date=2024-01-11nexmark-q7-blackhole-medium-1cn-affinity
baseline: http://metabase.risingwave-cloud.xyz/question/1502-nexmark-q7-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-190?start_date=2023-09-08nexmark-q17-blackhole-4x-medium-1cn-affinity
scaling up 4X: http://metabase.risingwave-cloud.xyz/question/9270-nexmark-q17-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2767?start_date=2024-01-04q0 is a stateless query that does 0 computation.
These are affinity settings.