Open shwina opened 8 months ago
For 1, here's pandas implementation for finding the window bounds (defined in terms of start/end indices instead of the end - start difference (?)) https://github.com/pandas-dev/pandas/blob/4ed67ac9ef3d9fde6fb8441bc9ea33c0d786649e/pandas/_libs/window/indexers.pyx#L107. pandas uses a sliding window algorithm for this case
For 2, I suppose you could defined the non-grouped case as the 1 large group so that group_range_rolling_window
could be used
For 2, I suppose you could defined the non-grouped case as the 1 large group so that group_range_rolling_window could be used
I did experiment with this and it is still somewhat slow. Perhaps libcudf is serializing within groups.
This is not due to the computation of the rolling window sizes, but rather that the libcudf rolling window computation is slow, I think. Here's an example to show that:
import cudf
import cudf._lib as libcudf
import numpy as np
dt = cudf.date_range("2001-01-01", "2002-01-01", freq="1s")
df = cudf.DataFrame({"x": np.random.rand(len(dt))}, index=dt)
max_window_size = 86400
pre_window = cudf.core.column.as_column(
np.concatenate([np.arange(1, max_window_size, dtype="int32"), np.full(len(df) - (max_window_size - 1), max_window_size, dtype="int32")])
)
source_column = df["x"]._column
follow_window = cudf.core.column.full(len(df), 0, dtype="int32")
window = None
min_periods = 1
center = False
op = "sum"
agg_params = {}
result = libcudf.rolling.rolling(source_column, pre_window, follow_window, window, min_periods, center, op, agg_params)
The call to rolling
takes 10 seconds for me. It's, in this example, linear in the size of the windows (change max_window_size
).
I think that scaling kind of makes sense in that irrespective of the window size, one produces the same output windows, but each window is O(window_size) large, and the window-by-window approach implemented then scales in the same way.
I was wondering if there's some kind of fourier-space approach that one might use, but the potential for non-equispaced samples complicates things (there are non-uniform FFT methods but they are non-exact). And my brain is not sufficiently in gear.
In any case, it feels like this should be able to run faster than it does, and I wonder if it can do so by a combination of change in parallelisation strategy and/or clever algorithmic changes.
cc @mythrocks
I was wondering if there's some kind of fourier-space approach that one might use, but the potential for non-equispaced samples complicates things (there are non-uniform FFT methods but they are non-exact). And my brain is not sufficiently in gear.
No FFTs needed I think, this should be solvable in $\mathcal{O}(n)$ time for an $n$-row column via a summed-area table approach (AKA, in 1D, a prefix scan) for rolling operations whose aggregation op has an inverse.
This would be a two-pass algorithm I think, let's take sum
as the example op.
Pass 1: compute scan(+, column) -> scan_column
Pass 2: For each row i
, the result is scan_column[i + forward_size[i]] - scan_column[i - backward_size[i]]
For things like variance and covariance, one needs to use some suitable adaptation of Welford's online approach. Some relevant recent papers:
Edit: one would have to worry (more) about overflow than with the naive approach.
There's almost definitely two inefficiencies at play here then, computing the window sizes given an offset is slower than we'd like, and the rolling window aggregation implementation is slower than we'd like.
computing the window sizes given an offset is slower than we'd like
I didn't really manage to measure that part as a noticeable problem, but maybe I was doing something different.
In your benchmark you're constructing the inputs to the libcudf rolling
function by hand. But going through the public API takes you down a code path that uses a numba kernel to do that.
In your benchmark you're constructing the inputs to the libcudf
rolling
function by hand. But going through the public API takes you down a code path that uses a numba kernel to do that.
Ah sorry, yes, now I see it. I was tricked by the lack of synchronisation in the numba kernel launch.
Yes, that kernel has exactly the same problem the rolling window kernel does. Each row linearly searches backwards in the column until the difference between the preceding entry and the current one is larger than the requested offset.
I think you can do this by doing a reverse prefix scan of the differences between the entries in the to-be-windowed column, and then ... (brain out of gear again)
cc @harrism as local scan expert.
Just to note one further thing while it is in my thoughts, the (potential) downside to doing a full-column scan to implement this is, in addition to overflow, numerical roundoff if using floating point types[^1].
[^1]: The Chiemlowiec paper linked above provides bounds on the number of bits required to compute mean and variance if data are represented in fixed point. If the data are normally distributed with small(ish) variance, these bounds are not too bad, but if they are heavy-tailed they overflow 64bit.
The general term of art here is range queries. If the binary operator induces a group structure on the data then these can be done as suggested above (via prefix scans). If it only induces a semigroup structure (no inverse), for example rolling-min, then one needs to build more sophisticated data structures (A.C. Yao, Space-time tradeoff for answering range queries (1982), https://doi.org/10.1145/800070.802185), but can be done in at worse $\Theta{(c n)}$ space and $\mathcal{O}(\alpha_c(n))$ time, where $\alpha_c$ is the inverse Ackermann function (so effectively constant time for any feasible $n$)).
Since range minimum queries pop up a lot in geospatial analysis, I wonder if the cuspatial team implemented them.
Note that running the code through a profiler will show execution time being spent in the next CUDA kernel (column.full) - but that's a red herring I think, because there's no synchronization after the numba kernel call.
I don't think this should be true. I think Nsight systems can distinguish the execution time of kernels even without synchronisation. If you time manually in host code then you need to synchronize to time accurately.
Oh sorry, I meant a host profiler (in this case the Python profiler cprofile
)
Regarding the numba kernel to find the windows, a first step is usually to move this to C++. When I search the codebase for this it only finds the definition, no calls. However the non-numba version is called, here: https://github.com/rapidsai/cudf/blob/d158ccdbe651952bd649cb0f17c41467c5209824/python/cudf/cudf/core/window/rolling.py#L483
So is this numba kernel actually being used?
Next question, is the arr
data passed to it random, or does it happen to be ordered? If the latter, then this could be replaced by a call to thrust::lower_bound
in C++ (with fancy iterators).
Regarding the numba kernel to find the windows, a first step is usually to move this to C++. When I search the codebase for this it only finds the definition, no calls. However the non-numba version is called, here:
So is this numba kernel actually being used?
Next question, is the
arr
data passed to it random, or does it happen to be ordered? If the latter, then this could be replaced by a call tothrust::lower_bound
in C++ (with fancy iterators).
The non-numba version calls the numba version: https://github.com/rapidsai/cudf/blob/6f6e521257dce5732eea7b6b9d56243f8b0a69cc/python/cudf/cudf/utils/cudautils.py#L32
I am not sure if the rolling window API allows non-sorted arrays but I would have thought not, so arr is probably (someone else to confirm) required to be sorted in ascending order.
Apologies for being late to this party.
The general term of art here is range queries... ... I am not sure if the rolling window API allows non-sorted arrays but I would have thought not, so arr is probably (someone else to confirm) required to be sorted in ascending order.
Please pardon my inexperience with the Pandas side of rolling window. (I've done a little bit of work on the Apache Spark end of this problem.)
If arr
is with reference to the order-by column in range queries, then you're partly right: the column needs to be sorted. But the ordering may be ascending or descending.
Looking at grouped_range_rolling_window
might illuminate the matter: The function takes the cudf::order
corresponding to the column. The window ranges are calculated depending on the direction of ordering.
With large windows, the
.rolling()
function in cuDF can be pathologically slow:Why is it slow?
Of the 10s of execution time above, about 8s is spent in computing the window sizes, which is done in a hand-rolled numba CUDA kernel: https://github.com/rapidsai/cudf/blob/6f6e521257dce5732eea7b6b9d56243f8b0a69cc/python/cudf/cudf/utils/cudautils.py#L17. Note that running the code through a profiler will show execution time being spent in the next CUDA kernel (
column.full
) - but that's a red herring I think, because there's no synchronization after the numba kernel call.What can we do about it?
I see a couple of options here:
libcudf
's responsibility to compute the window sizes. I believe they already do window sizes computation in the context of grouped rolling window aggreagations: seegrouped_range_rolling_window()
.