Open beckernick opened 4 years ago
Based on the following small benchmark, this can probably be done fast enough in Python.
import cudf
import numpy as np
def mad(self):
# mad formula
n = len(self)
m = self.mean()
mad = ((self - m).abs() / n).sum()
return mad
nrows = (1e6, 10e6)
for n in nrows:
s = cudf.Series(np.random.normal(10, 5, int(n)), dtype="float32")
ps = s.to_pandas()
print(f"{int(n):,} rows:")
%time mp = ps.mad()
%time mg = mad(s)
print()
1,000,000 rows:
CPU times: user 22.9 ms, sys: 0 ns, total: 22.9 ms
Wall time: 22.5 ms
CPU times: user 782 µs, sys: 3.24 ms, total: 4.02 ms
Wall time: 3.89 ms
10,000,000 rows:
CPU times: user 213 ms, sys: 100 ms, total: 313 ms
Wall time: 313 ms
CPU times: user 1.23 ms, sys: 4.25 ms, total: 5.49 ms
Wall time: 5.24 ms
For API compatibility and supporting exploratory analysis, we should support Series and DataFrame mean absolute deviation. See the pandas mean absolute deviation for more information.
This can be implemented for Python Series and DataFrame as a stopgap as
Series.mad
and then leverage_apply_support_method
forDataFrame.mad
. It's not 10x faster, but it gets the job done well.