Closed fuyufjh closed 1 year ago
2023-03-22's longevity test (longnxkbkf-20230322-170646
) failed exactly the same way.
limit 1
batch query caused first crash (this issue)@liurenjie1024 PTAL and feel free to assign to others.
After checking recect failures, it's caused by batch query, so let's close it first.
Recured at today's longevity test.
https://buildkite.com/risingwave-test/longevity-kubebench/builds/274
Every time the batch query Running command
SELECT * FROM nexmark_q14 LIMIT 1
failed, the restart count would +1.
I think the relation between crash and batch query failure is quite clear:
2023-04-26 12:22:18 CST Failed going for retry 0 out of 3
2023-04-26 12:52:06 CST Failed going for retry 0 out of 3
2023-04-26 13:21:53 CST Failed going for retry 0 out of 3
2023-04-26 13:51:55 CST Failed going for retry 0 out of 3
2023-04-26 14:22:15 CST Failed going for retry 0 out of 3
2023-04-26 14:50:01 CST Failed going for retry 0 out of 3
2023-04-26 15:01:17 CST Failed going for retry 0 out of 3
The meta cache size keeps growing:
While we only allocated 300M bytes to meta cache: https://rqa-logs.s3.ap-southeast-1.amazonaws.com/longevity/274_logs.txt 2023-04-26T07:14:13.554425Z INFO risingwave_compute::server: > total_memory: 13.00 GiB 2023-04-26T07:14:13.554428Z INFO risingwave_compute::server: > storage_memory: 3.12 GiB 2023-04-26T07:14:13.554431Z INFO risingwave_compute::server: > block_cache_capacity: 958.00 MiB 2023-04-26T07:14:13.554435Z INFO risingwave_compute::server: > meta_cache_capacity: 319.00 MiB 2023-04-26T07:14:13.554437Z INFO risingwave_compute::server: > shared_buffer_capacity: 1.56 GiB 2023-04-26T07:14:13.554441Z INFO risingwave_compute::server: > file_cache_total_buffer_capacity: 319.00 MiB 2023-04-26T07:14:13.554443Z INFO risingwave_compute::server: > compute_memory: 7.28 GiB 2023-04-26T07:14:13.554445Z INFO risingwave_compute::server: > reserved_memory: 2.60 GiB
cc @hzxa21
meta_cache_capacity
It is inevitable because they are all in use.
It is inevitable because they are all in use.
Maybe we should block some operation before allocation memory?
It is inevitable because they are all in use.
Maybe we should block some operation before allocation memory?
@Little-Wallace has already found a solution. We can wait for his PR.
Any update? cc @soundOfDestiny
Any update? cc @soundOfDestiny
FYI, #9517 is merged.
Let's keep this open for a while, currently we still didn't enable limit 1 in longevity test and verify it.
Describe the bug
In terms of timing, this issue seems to be related to the final results checking stage, which runs a batch query over the result MV:
You may see this from the attached BuildKite log. Before 11:22, everything worked well; then the result check began, and we got several restarts.
I suspect the batch query caused some dramatic memory spike. Any ideas? By the way, why the “batch query’s memory usage” is always empty?
Slack thread: https://risingwave-labs.slack.com/archives/C0423G2NUF8/p1679395706998939
To Reproduce
No response
Expected behavior
No response
Additional context
No response