risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.
https://www.risingwave.com/slack
Apache License 2.0
6.59k stars 538 forks source link

nightly-20240518 weekly test perf degradation #16817

Open cyliu0 opened 1 month ago

cyliu0 commented 1 month ago

Describe the bug

Perf degrdation in weekly test. Those two SKUs runs only in weekly test now.

https://buildkite.com/risingwave-test/tpch-benchmark/builds/1074

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3708

http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2023-09-24

http://metabase.risingwave-cloud.xyz/question/9112-nexmark-q7-rewrite-blackhole-4x-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-2808?start_date=2023-12-21

+----------------------------------------------------------+--------------+------------+-----------------------------------+---------------------+-----------------------------+-------------------------------+
| BENCHMARK NAME                                           | EXECUTION ID | STATUS     | KEY METRICS                       | FLUCTUATION OF BEST | FLUCTUATION OF LAST 10 DAYS | FLUCTUATION OF LAST EXECUTION |
+----------------------------------------------------------+--------------+------------+-----------------------------------+---------------------+-----------------------------+-------------------------------+
| nexmark-q7-rewrite-blackhole-4x-medium-1cn-affinity      |        28800 | Negative   | avg-source-output-rows-per-second | -28.78%             | -15.22%                     | -26.42%                       |
| tpch-q8-bs-medium-1cn-affinity                           |        28826 | Negative   | avg-source-output-rows-per-second | -49.35%             | -29.35%                     | -43.13%                       |

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

st1page commented 1 month ago

For nexmark-q7-rewrite-blackhole-4x-medium-1cn-affinity, the tpch-q8 could be due to other issues It seems because of the same issue with https://github.com/risingwavelabs/risingwave/issues/15142 There is greater Imbalance in 5.18.

5.12: image

5.18: image

cc @lmatz

st1page commented 1 month ago

No conclusion has been reached regarding the reason for the performance degradation of TPCH Q8. The current phenomena:

  1. From the perspective of backpressure, the bottleneck occurs at a certain append-only hash join
  2. Both operator cache and block cache show higher miss rates on the 18th compared to the 12th.

rerun 0518 (slow): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1075 test 0514 (fast): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1076 test 0515 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1077 test 0516 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1078

lmatz commented 1 month ago

For nexmark q7, the network bandwidth between RW and Kafka is not the same:

Previously: SCR-20240520-l4q

This time: SCR-20240520-l4t

https://buildkite.com/risingwave-test/nexmark-benchmark/builds/3708


Name:             benchmark-kafka-0
--
  | Namespace:        nexmark-ht-4x-1cn-affinity-weekly-20240518
  | Command:
  | /scripts/setup.sh
  | State:          Running
  | Started:      Sat, 18 May 2024 17:04:06 +0000
  | Ready:          True
  | Restart Count:  0
  | Limits:
  | cpu:     8
  | memory:  13Gi
  | Requests:
  | cpu:      7
  | memory:   13Gi

Hmmm, there is a slight chance that Kafka is not enough, although Kafka should be I/O bound instead of CPU bound

or because the machine is not large enough and only an "unstable" "up to 12.5Gbps" bandwidth can be achieved, see https://github.com/risingwavelabs/risingwave/issues/15142#issuecomment-1956060259

st1page commented 1 month ago

No conclusion has been reached regarding the reason for the performance degradation of TPCH Q8. The current phenomena:

  1. From the perspective of backpressure, the bottleneck occurs at a certain append-only hash join
  2. Both operator cache and block cache show higher miss rates on the 18th compared to the 12th.

rerun 0518 (slow): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1075 test 0514 (fast): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1076 test 0515 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1077 test 0516 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1078

Overall, there has been some fluctuation in the performance of the images on the 15th and 16th, but I believe the main performance drop is due to a change on the nightly-20240517. https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240517

http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2024-05-20

st1page commented 1 month ago

No conclusion has been reached regarding the reason for the performance degradation of TPCH Q8. The current phenomena:

  1. From the perspective of backpressure, the bottleneck occurs at a certain append-only hash join
  2. Both operator cache and block cache show higher miss rates on the 18th compared to the 12th.

rerun 0518 (slow): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1075 test 0514 (fast): https://buildkite.com/risingwave-test/tpch-benchmark/builds/1076 test 0515 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1077 test 0516 ( ) https://buildkite.com/risingwave-test/tpch-benchmark/builds/1078

Overall, there has been some fluctuation in the performance of the images on the 15th and 16th, but I believe the main performance drop is due to a change on the nightly-20240517. https://github.com/risingwavelabs/rw-commits-history?tab=readme-ov-file#nightly-20240517

http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2024-05-20

Unfortunately, it appears randomly... The degradation can happen in 0516's image but not happen in 0517's image... http://metabase.risingwave-cloud.xyz/question/4966-tpch-q8-bs-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-371?start_date=2024-05-20

st1page commented 1 month ago

The unstable degradation of q8 happens on nightly-20240512 too image