rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.01k stars 870 forks source link

[FEA] Series and DataFrame mean absolute deviation #3676

Open beckernick opened 4 years ago

beckernick commented 4 years ago

For API compatibility and supporting exploratory analysis, we should support Series and DataFrame mean absolute deviation. See the pandas mean absolute deviation for more information.

This can be implemented for Python Series and DataFrame as a stopgap as Series.mad and then leverage _apply_support_method for DataFrame.mad. It's not 10x faster, but it gets the job done well.

import cudf
import numpy as np
​
def mad(self):
    # mad formula
    n = len(self)
    m = self.mean()
    mad = ((self - m).abs() / n).sum()
    return mad
​
​# 1 million rows
s = cudf.Series(np.random.normal(10,5,1_000_000))
ps = s.to_pandas()
​
%time mp = ps.mad()
%time mg = mad(s)
print(mp)
print(mg)
CPU times: user 31.9 ms, sys: 0 ns, total: 31.9 ms
Wall time: 32 ms
CPU times: user 0 ns, sys: 7.24 ms, total: 7.24 ms
Wall time: 23.2 ms
3.990811998439671
3.990811998439673
beckernick commented 2 years ago

Based on the following small benchmark, this can probably be done fast enough in Python.

import cudf
import numpy as np
​
def mad(self):
    # mad formula
    n = len(self)
    m = self.mean()
    mad = ((self - m).abs() / n).sum()
    return mad
​
nrows = (1e6, 10e6)
for n in nrows:
    s = cudf.Series(np.random.normal(10, 5, int(n)), dtype="float32")
    ps = s.to_pandas()
    print(f"{int(n):,} rows:")
    %time mp = ps.mad()
    %time mg = mad(s)
    print()
1,000,000 rows:
CPU times: user 22.9 ms, sys: 0 ns, total: 22.9 ms
Wall time: 22.5 ms
CPU times: user 782 µs, sys: 3.24 ms, total: 4.02 ms
Wall time: 3.89 ms

10,000,000 rows:
CPU times: user 213 ms, sys: 100 ms, total: 313 ms
Wall time: 313 ms
CPU times: user 1.23 ms, sys: 4.25 ms, total: 5.49 ms
Wall time: 5.24 ms