Closed lmatz closed 1 year ago
is it because the memory is split between CN and compactor node? cc @huangjw806 CN takes 13 GB and compactor node takes 3 GB?
It seems we can further reduce the memory allocation for compactor node. It takes much less than 3 GB.
But it suggests that if we want to put components together in the future to further reduce cost and improve performance-cost efficiency, static memory allocation is not ideal.
CN takes 13 GB and compactor node takes 3 GB?
CN takes 12 GB and compactor node takes 3 GB.
I see, so the total memory usage << 16GB is expected
but the data cache miss rate so high still unexplained 🤔
I noticed the data cache size in the grafana shows 1.15GiB. But I'm not sure where the value comes from (I noticed some calculation in storage_memory_config
).
@huangjw806 let's try one more time, with Compactor 1GB and Compute Node 14GB, and reserved memory = 10% instead of 20%.
14GB CN and reserved memory = 10%: https://buildkite.com/risingwave-test/nexmark-benchmark/builds/1398#01892a0d-1879-4c36-8d43-fc675590a367
If the object total size is the size of the current state, I cannot understand why such a small state (smaller than the memory usage of compute node) can induce 30%+ cache miss?
Why distinct agg cache miss rate so high?
The object total size
is after compression.
We should look at the KV size and it is larger than the memory of Compute node.
The performance right now for 1 machine deployment is close to 100K/s. It is acceptable since it is inevitable to encounter cache misses and fetch things from S3 (which induces much higher latency than EBS).
We can benchmark again when the file cache is enabled.
Metabase: http://risingwave-perf-test-dashboard-metabase.us-west-2.elasticbeanstalk.com/question/3272-nexmark-q16-blackhole-medium-1cn-affinity-avg-source-output-rows-per-second-rows-s-history-thtb-179?start_date=2023-05-19
Grafana Dashboard: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1688576918000&to=1688578721000&var-namespace=nexmark-bs-15-105-affinity-daily-20230705
This is a 16 GB machine
1 and 2 seem to contradict each other
link #7271 RW is worse than Flink, e.g. 1/2, for this query.