perf: optimize insert perfomance

lmatz commented 2 months ago

We used to run sysbench on 3 8C16G CN: dashboard: http://metabase.risingwave-cloud.xyz/dashboard/278-sysbench-3cn-rw-qps example: https://buildkite.com/risingwave-test/sysbench/builds/698#018e061f-1719-4907-9c2c-15bddb895327

But lately we have been running sysbench on 1 8C16G CN: dashboard: http://metabase.risingwave-cloud.xyz/dashboard/1133-sysbench-1cn-rw-qps example: https://buildkite.com/risingwave-test/sysbench/builds/746#018ee38f-d9de-4f54-b698-7e7518a4a829

In Sysbench, we have two insert benchmarks:

OLTP-insert, which is included in the vanilla sysbench: https://github.com/risingwavelabs/sysbench/blob/master/src/lua/oltp_insert.lua
Bulk-insert, which is added by us: https://github.com/risingwavelabs/sysbench/blob/master/src/lua/bulk_insert.lua

There are feedback from users that the insert performance is not desirable. We want to improve it.

One observation is that 3CN(128 sysbench threads) vs 1CN(128 sysbench threads) vs 1CN (256 sysbench threads) do not differ from each other too much.

Some other common observations:

The frontend is CPU-intensive, i.e. 600+%, but not utilizing all 8 CPUs.
The CPU utilization of compute node is low.

We enabled CPU profiling by default. The CPU flamegraph is generated and uploaded in buildkite pipelines, e.g. https://buildkite.com/risingwave-test/sysbench/builds/746#018ee38f-d9de-4f54-b698-7e7518a4a829 under artifacts tab.

lmatz commented 2 months ago

One example: https://buildkite.com/risingwave-test/sysbench/builds/755#018ef126-d6ee-41e0-a309-606a9c89119d

CN and FN FlameGraph under the artifacts tab.

oltp-insert workload. no checks in MV executor (disabled index): https://github.com/risingwavelabs/sysbench/blob/master/src/lua/oltp_insert.lua#L50-L61

Grafana: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1713443508000&to=1713443906000&var-namespace=sysbench-lmatz-test

Frontend is using close to 7CPUs out of 8CPUs in total. Frontend CPU usage is much higher than Compute CPU usage.

Probably we optimize the CPU usage of Frontend.

SCR-20240418-tg5 SCR-20240418-tcd

Considering that the insert statement is the same, I suppose FN does not need to spend so much time on gen_batch_plan_by_statement? Need something like a plan cache? cc: @chenzl25

chenzl25 commented 2 months ago

Considering that the insert statement is the same, I suppose FN does not need to spend so much time on gen_batch_plan_by_statement? Need something like a plan cache? cc: @chenzl25

Theoretically, yes. Plan cache can improve the performance in this scenario, but it has some shortcomings as well, i.e. it needs to normalize the SQL and parameterize it which would introduce an additional overhead, furthermore, from the optimizer view, during optimization we can't see the actual parameter anymore which would introduce a huge refactoring to RisingWave. With a plan cache, people might introduce a new optimization without considering how it will affect the optimization time anymore, so I think we'd better do not introduce plan cache.

risingwavelabs / risingwave

perf: optimize insert perfomance #16347