Open beckernick opened 1 year ago
This is a duplicate of https://github.com/rapidsai/cudf/issues/12551 but this issue summarizes the problem well and has useful context / code snippets for validation. I believe the person who posted that diagnosed the reason for the slowness correctly:
I think it calculates the whole window every time rather than just replacing the first and last elements of the moving window of 400000 elements, causing this delay whereas pandas does not do it that way.
We would need to investigate some kind of cooperative groups or consider the use of thrust/cub if appropriate (and not already used). I'm not sure how much of that we can combine with the JIT approaches to custom rolling aggregations.
There was a more recent issue that also covered this problem. I wrote up some thoughts there (#15119).
tl;dr: as @bdice says, we need much smarter algorithms for large windows.
Rolling window aggregations are slow with large window sizes. I believe this is known behavior to many contributors but a user ran into this yesterday and I couldn't find an issue or reference documentation from an initial search.
This is more likely to occur with time-oriented window sizes (e.g., "60min" or "9h") on high-frequency data rather than fixed integer length windows (as these windows are less likely to be as long).
The examples below illustrates this behavior. Observed GPU utilization is at 100% throughout the operations.
Filing an issue to document this performance behavior If already captured elsewhere, will close as a duplicate.