risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.8k stars 564 forks source link

q16 has high data cache miss rate but the state is small and memory is not fully utilized #10777

Closed lmatz closed 1 year ago

lmatz commented 1 year ago

Metabase: http://risingwave-perf-test-dashboard-metabase.us-west-2.elasticbeanstalk.com/question/3272-nexmark-q16-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-179?start_date=2023-05-19

Grafana Dashboard: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1688576918000&to=1688578721000&var-namespace=nexmark-bs-15-105-affinity-daily-20230705

This is a 16 GB machine

SCR-20230706-ezk

SCR-20230706-eyu SCR-20230706-ez8

  1. the 16GB memory is not entirely used. Adding CN and Compactor together (they deploy on the same machine), still under 10GB.
  2. The state (all the SST file sizes added together) is smaller than the current memory usage, confused why the cache miss rate is so high?

1 and 2 seem to contradict each other

link #7271 RW is worse than Flink, e.g. 1/2, for this query.

lmatz commented 1 year ago

is it because the memory is split between CN and compactor node? cc @huangjw806 CN takes 13 GB and compactor node takes 3 GB?

It seems we can further reduce the memory allocation for compactor node. It takes much less than 3 GB.

lmatz commented 1 year ago

But it suggests that if we want to put components together in the future to further reduce cost and improve performance-cost efficiency, static memory allocation is not ideal.

huangjw806 commented 1 year ago

CN takes 13 GB and compactor node takes 3 GB?

CN takes 12 GB and compactor node takes 3 GB.

lmatz commented 1 year ago

I see, so the total memory usage << 16GB is expected

but the data cache miss rate so high still unexplained 🤔

yuhao-su commented 1 year ago

I noticed the data cache size in the grafana shows 1.15GiB. But I'm not sure where the value comes from (I noticed some calculation in storage_memory_config).

lmatz commented 1 year ago

@huangjw806 let's try one more time, with Compactor 1GB and Compute Node 14GB, and reserved memory = 10% instead of 20%.

lmatz commented 1 year ago

14GB CN and reserved memory = 10%: https://buildkite.com/risingwave-test/nexmark-benchmark/builds/1398#01892a0d-1879-4c36-8d43-fc675590a367

Grafana: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?from=1688630109000&orgId=1&to=1688631912000&var-datasource=Prometheus%3A+test-useast1-eks-a&var-namespace=nexmark-bs-15-105-affinity-reserve-10-percents-memory-test

SCR-20230707-j2c SCR-20230707-j2i SCR-20230707-j2r SCR-20230707-lrs

If the object total size is the size of the current state, I cannot understand why such a small state (smaller than the memory usage of compute node) can induce 30%+ cache miss?

lmatz commented 1 year ago

SCR-20230707-w9c SCR-20230707-w90 SCR-20230707-w96

Why distinct agg cache miss rate so high?

lmatz commented 1 year ago

The object total size is after compression. We should look at the KV size and it is larger than the memory of Compute node.

The performance right now for 1 machine deployment is close to 100K/s. It is acceptable since it is inevitable to encounter cache misses and fetch things from S3 (which induces much higher latency than EBS).

We can benchmark again when the file cache is enabled.