Closed cyliu0 closed 9 months ago
Do we have a memory dump for the case just like our longevity test? The grafana metrics are normal to me, the node memory for CNs are all below the limit compute = { limit = "13Gi", request = "13Gi" }
. cc @lmatz
Yes. We have the memory profiling enabled. You can check the same s3 bucket for the memory dump of this pipeline https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/143#018cb36c-0c27-48e2-b22f-94055d8cdd83
The bucket for the OOM incident above seems gone, may need to run it again to generate a dump
@cyliu0 may need to test it again, link #13060
The reason might be the parallelism is too high when we use medium-arm-3cn-all-affinity testbed which uses a 32C64G node for all the CNs. https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/219
The q9 can pass with medium-3cn testbed which uses 3 different 8C16G nodes for CNs. https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc/builds/220
This might be the deprecated of https://github.com/risingwavelabs/risingwave/issues/13060. Close this one
Describe the bug
ch-benchmark q9 will lead to compute node OOM when there are 3 compute nodes. But it won't be OOM when there is only one compute node. The compute node has the same memory size.
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=ebec273b-0774-4ccd-90a9-c2a22144d623&var-namespace=ch-pg-cdc-cy&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All&from=1703731200000&to=now&refresh=10s
The first part was running with 3 compute nodes. And there is only 1 compute node in the second part. The memory consumption will be much higher when there are 3 compute nodes.
You can use the following configuration to reproduce the OOM. env.override.toml
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
nightly-20231227
Additional context
No response