Open bymzy opened 5 months ago
i make a window over partition by.
any suggestion ?
It can be bottlenecked on CPU or memory resources
is CPU fully utilized, what's the usage of memory?
are there many cache misses, how about remote I/O rate and its bandwidth?
the cpu and memory is not fully utilized。should i adjust parallelism or something? @lmatz
is it possible to generate a snapshot of all the dashboards on Grafana?
By default, the "streaming_parallelism" is set to the total number of CPUs. Wonder if you changed this session variable before?
LAG(md.data) OVER (PARTITION BY md.meterId ORDER BY md.datatime) AS prev_power,
LAG(md.data, 2) OVER (PARTITION BY md.meterId ORDER BY md.datatime) AS prev_power_2,
LAG(md.data, 3) OVER (PARTITION BY md.meterId ORDER BY md.datatime) AS prev_power_3,
well, i found something about the window . does this means that the window size is unbounded? @lmatz
if it is , then that maybe the key point.
I believe by default
The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this sets the frame to be all rows from the partition start up through the current row's last ORDER BY peer.
https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
cc @stdrc, wonder if RW's semantics of over window functions follow Postgres completely
LAG(md.data) OVER (PARTITION BY md.meterId ORDER BY md.datatime) AS prev_power, LAG(md.data, 2) OVER (PARTITION BY md.meterId ORDER BY md.datatime) AS prev_power_2, LAG(md.data, 3) OVER (PARTITION BY md.meterId ORDER BY md.datatime) AS prev_power_3,
well, i found something about the window . does this means that the window size is unbounded? @lmatz
No. lag
s with constant offset
s are not "unbounded".
However, by default we do cache all entries of a partition when the partition key is cached. So basically there're two levels of cache here: First, partition key -> partition (contains partition entries cache)
; second, inside partition entries cache order key + pk -> row
. By default the second level contains all entries in cached partitions.
Two things worth trying here:
Change the over window cache policy:
set rw_streaming_over_window_cache_policy = recent; -- or recent_first_n or recent_last_n, depending on your workload pattern
Create a base MV with order by <partition_key>, <order_key>
, and create the desired MV on top of that. So that rows input to OverWindow operator should be ordered and can achieve better performance.
Hopefully these methods can help reduce the jaggies.
rw-main risingwave_storage::hummock::event_handler::hummock_event_handler: cannot acquire lock for all read version pending_count=1 total_count=192
i found many log like this, what does this mean? and i found that minio gets quite busy, but RW not, problem of IO?
@stdrc
rw-main risingwave_storage::hummock::event_handler::hummock_event_handler: cannot acquire lock for all read version pending_count=1 total_count=192
i found many log like this, what does this mean? and i found that minio gets quite busy, but RW not, problem of IO?
Not quite familiar with storage components. @Li0k @hzxa21, does this imply something?
rw-main risingwave_storage::hummock::event_handler::hummock_event_handler: cannot acquire lock for all read version pending_count=1 total_count=192
i found many log like this, what does this mean? and i found that minio gets quite busy, but RW not, problem of IO?
Not quite familiar with storage components. @Li0k @hzxa21, does this imply something?
No. This log is expected and is not relevant for the zero source throughput issue. I guess this log is added because we want to measure how frequent the contention happens in our implementation. Should we make this a debug log? cc @wenym1
Can you also share the "Barrier Sync Latency" and "Barrier Inflight Latency" panels? Also, can you follow the instruction here and share the await tree dump?
I suspect the reason why source throughput drops to 0 is because there is a backpressure somewhere. When there is severe backpressure in RisingWave, source will stop pulling data from upstream.
rw-main risingwave_storage::hummock::event_handler::hummock_event_handler: cannot acquire lock for all read version pending_count=1 total_count=192
i found many log like this, what does this mean? and i found that minio gets quite busy, but RW not, problem of IO?
Not quite familiar with storage components. @Li0k @hzxa21, does this imply something?
No. This log is expected and is not relevant for the zero source throughput issue. I guess this log is added because we want to measure how frequent the contention happens in our implementation. Should we make this a debug log? cc @wenym1
Can you also share the "Barrier Sync Latency" and "Barrier Inflight Latency" panels? Also, can you follow the instruction here and share the await tree dump?
I suspect the reason why source throughput drops to 0 is because there is a backpressure somewhere. When there is severe backpressure in RisingWave, source will stop pulling data from upstream.
sure. back pressure is quite high i guess. @hzxa21
@hzxa21 here is the await tree dump
and fragment 22 looks like
More information. i found the executor cache memory gets up and down, up and down , repeatly. And every time it gets down, seems the executor write data to minio.
Does this means that the executor is waiting for something, or some timer mzybe?in my case, the executor (OverWindow) is waiting for input data? But i got so many messages not consumed in kafka , this is weird. @hzxa21
More information. i found the executor cache memory gets up and down, up and down , repeatly. And every time it gets down, seems the executor write data to minio.
Does this means that the executor is waiting for something, or some timer mzybe?in my case, the executor (OverWindow) is waiting for input data? But i got so many messages not consumed in kafka , this is weird. @hzxa21
Have you tried to set rw_streaming_over_window_cache_policy = recent_last_n;
?
This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity
label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄
Describe the bug
Error message/log
No response
To Reproduce
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response