fuyufjh commented 1 year ago

Describe the bug

In terms of timing, this issue seems to be related to the final results checking stage, which runs a batch query over the result MV:

Running command SELECT * FROM nexmark_q0 LIMIT 1

You may see this from the attached BuildKite log. Before 11:22, everything worked well; then the result check began, and we got several restarts.

I suspect the batch query caused some dramatic memory spike. Any ideas? By the way, why the “batch query’s memory usage” is always empty?

Slack thread: https://risingwave-labs.slack.com/archives/C0423G2NUF8/p1679395706998939

To Reproduce

No response

Expected behavior

No response

Additional context

No response

fuyufjh commented 1 year ago

2023-03-22's longevity test (longnxkbkf-20230322-170646) failed exactly the same way.

Problem #1: limit 1 batch query caused first crash (this issue)
Problem #2: Then, it went into crash loop (#8693)

fuyufjh commented 1 year ago

@liurenjie1024 PTAL and feel free to assign to others.

liurenjie1024 commented 1 year ago

After checking recect failures, it's caused by batch query, so let's close it first.

fuyufjh commented 1 year ago

Recured at today's longevity test.

https://buildkite.com/risingwave-test/longevity-kubebench/builds/274

Every time the batch query Running command SELECT * FROM nexmark_q14 LIMIT 1 failed, the restart count would +1.

fuyufjh commented 1 year ago

I think the relation between crash and batch query failure is quite clear:

2023-04-26 12:22:18 CST Failed going for retry 0 out of 3
2023-04-26 12:52:06 CST Failed going for retry 0 out of 3
2023-04-26 13:21:53 CST Failed going for retry 0 out of 3
2023-04-26 13:51:55 CST Failed going for retry 0 out of 3
2023-04-26 14:22:15 CST Failed going for retry 0 out of 3
2023-04-26 14:50:01 CST Failed going for retry 0 out of 3
2023-04-26 15:01:17 CST Failed going for retry 0 out of 3

liurenjie1024 commented 1 year ago

The meta cache size keeps growing:

https://g-2927a1b4d9.grafana-workspace.us-east-1.amazonaws.com/d/EpkBw5W4k/risingwave-test-dashboard?orgId=1&var-namespace=longnxkbkf-20230425-140801&from=1682482800000&to=1682483040000&editPanel=92

While we only allocated 300M bytes to meta cache: https://rqa-logs.s3.ap-southeast-1.amazonaws.com/longevity/274_logs.txt 2023-04-26T07:14:13.554425Z INFO risingwave_compute::server: > total_memory: 13.00 GiB 2023-04-26T07:14:13.554428Z INFO risingwave_compute::server: > storage_memory: 3.12 GiB 2023-04-26T07:14:13.554431Z INFO risingwave_compute::server: > block_cache_capacity: 958.00 MiB 2023-04-26T07:14:13.554435Z INFO risingwave_compute::server: > meta_cache_capacity: 319.00 MiB 2023-04-26T07:14:13.554437Z INFO risingwave_compute::server: > shared_buffer_capacity: 1.56 GiB 2023-04-26T07:14:13.554441Z INFO risingwave_compute::server: > file_cache_total_buffer_capacity: 319.00 MiB 2023-04-26T07:14:13.554443Z INFO risingwave_compute::server: > compute_memory: 7.28 GiB 2023-04-26T07:14:13.554445Z INFO risingwave_compute::server: > reserved_memory: 2.60 GiB

cc @hzxa21

soundOfDestiny commented 1 year ago

meta_cache_capacity

It is inevitable because they are all in use.

liurenjie1024 commented 1 year ago

It is inevitable because they are all in use.

Maybe we should block some operation before allocation memory?

soundOfDestiny commented 1 year ago

It is inevitable because they are all in use.

Maybe we should block some operation before allocation memory?

@Little-Wallace has already found a solution. We can wait for his PR.

liurenjie1024 commented 1 year ago

Any update? cc @soundOfDestiny

soundOfDestiny commented 1 year ago

Any update? cc @soundOfDestiny

9517

soundOfDestiny commented 1 year ago

FYI, #9517 is merged.

liurenjie1024 commented 1 year ago

Let's keep this open for a while, currently we still didn't enable limit 1 in longevity test and verify it.

liurenjie1024 commented 1 year ago

Verified in https://buildkite.com/risingwave-test/longevity-kubebench/builds/350.

risingwavelabs / risingwave

bug: `limit 1` batch query caused OOM #8721

Describe the bug

To Reproduce

Expected behavior

Additional context

9517