[FEA] Faster rolling window aggregations with large window sizes

beckernick commented 1 year ago

Rolling window aggregations are slow with large window sizes. I believe this is known behavior to many contributors but a user ran into this yesterday and I couldn't find an issue or reference documentation from an initial search.

This is more likely to occur with time-oriented window sizes (e.g., "60min" or "9h") on high-frequency data rather than fixed integer length windows (as these windows are less likely to be as long).

The examples below illustrates this behavior. Observed GPU utilization is at 100% throughout the operations.

import cudf

df = cudf.datasets.timeseries(
    start='2000-01-01',
    end='2000-06-30',
    freq='1s',
)
pdf = df.to_pandas()
print(df.shape)
print(df.head())
(15638401, 4)
                       id     name         x         y
timestamp                                             
2000-01-01 00:00:00  1069    Edith -0.208702  0.451685
2000-01-01 00:00:01  1053   Yvonne -0.383893 -0.846287
2000-01-01 00:00:02   986    Sarah -0.718822 -0.980082
2000-01-01 00:00:03   999  Norbert -0.547608 -0.291836
2000-01-01 00:00:04   999   George -0.534662  0.300049

windows = ["1min", "60min", "3h", "12h", "1d"]

print("cuDF Rolling Windows")
for w in windows:
    %time out = df.rolling(w).x.max()

print("\n"*3)
print("Pandas Rolling Windows")
for w in windows:
    %time out = pdf.rolling(w).x.max()
cuDF Rolling Windows
CPU times: user 8.25 ms, sys: 20.2 ms, total: 28.5 ms
Wall time: 49.2 ms
CPU times: user 400 ms, sys: 8 ms, total: 408 ms
Wall time: 437 ms
CPU times: user 1.08 s, sys: 150 µs, total: 1.08 s
Wall time: 1.09 s
CPU times: user 4.3 s, sys: 7.07 ms, total: 4.31 s
Wall time: 4.36 s
CPU times: user 8.69 s, sys: 11.4 ms, total: 8.7 s
Wall time: 8.76 s

Pandas Rolling Windows
CPU times: user 516 ms, sys: 44.1 ms, total: 560 ms
Wall time: 558 ms
CPU times: user 511 ms, sys: 40 ms, total: 551 ms
Wall time: 550 ms
CPU times: user 522 ms, sys: 28.1 ms, total: 550 ms
Wall time: 549 ms
CPU times: user 494 ms, sys: 48 ms, total: 542 ms
Wall time: 541 ms
CPU times: user 499 ms, sys: 44 ms, total: 544 ms
Wall time: 542 ms

windows = [10, 100, 1000, 10000, 100000]

print("cuDF Rolling Windows")
for w in windows:
    %time out = df.rolling(w).x.max()

print("\n"*2)
print("Pandas Rolling Windows")
for w in windows:
    %time out = pdf.rolling(w).x.max()
cuDF Rolling Windows
CPU times: user 4.14 ms, sys: 7 µs, total: 4.15 ms
Wall time: 3.01 ms
CPU times: user 7.92 ms, sys: 4 ms, total: 11.9 ms
Wall time: 11.7 ms
CPU times: user 91.6 ms, sys: 3.8 ms, total: 95.4 ms
Wall time: 104 ms
CPU times: user 726 ms, sys: 58 µs, total: 726 ms
Wall time: 726 ms
CPU times: user 6.76 s, sys: 3.61 ms, total: 6.77 s
Wall time: 6.79 s

Pandas Rolling Windows
CPU times: user 430 ms, sys: 80.1 ms, total: 510 ms
Wall time: 519 ms
CPU times: user 425 ms, sys: 76.1 ms, total: 502 ms
Wall time: 500 ms
CPU times: user 435 ms, sys: 64.2 ms, total: 499 ms
Wall time: 498 ms
CPU times: user 416 ms, sys: 79.7 ms, total: 495 ms
Wall time: 494 ms
CPU times: user 420 ms, sys: 72.1 ms, total: 492 ms
Wall time: 491 ms

conda list | grep cudf
cudf                      23.02.00        cuda_11_py38_g5ad4a85b9d_0    rapidsai
cudf_kafka                23.02.00        py38_g5ad4a85b9d_0    rapidsai
dask-cudf                 23.02.00        cuda_11_py38_g5ad4a85b9d_0    rapidsai
libcudf                   23.02.00        cuda11_g5ad4a85b9d_0    rapidsai
libcudf_kafka             23.02.00          g5ad4a85b9d_0    rapidsai

Filing an issue to document this performance behavior If already captured elsewhere, will close as a duplicate.

bdice commented 1 year ago

This is a duplicate of https://github.com/rapidsai/cudf/issues/12551 but this issue summarizes the problem well and has useful context / code snippets for validation. I believe the person who posted that diagnosed the reason for the slowness correctly:

I think it calculates the whole window every time rather than just replacing the first and last elements of the moving window of 400000 elements, causing this delay whereas pandas does not do it that way.

We would need to investigate some kind of cooperative groups or consider the use of thrust/cub if appropriate (and not already used). I'm not sure how much of that we can combine with the JIT approaches to custom rolling aggregations.

wence- commented 3 months ago

There was a more recent issue that also covered this problem. I wrote up some thoughts there (#15119).

tl;dr: as @bdice says, we need much smarter algorithms for large windows.

rapidsai / cudf

[FEA] Faster rolling window aggregations with large window sizes #12774