[FEA] Series and DataFrame mean absolute deviation

rapidsai / cudf

cuDF - GPU DataFrame Library

Apache License 2.0

8.01k stars 870 forks source link

For API compatibility and supporting exploratory analysis, we should support Series and DataFrame mean absolute deviation. See the pandas mean absolute deviation for more information.

This can be implemented for Python Series and DataFrame as a stopgap as Series.mad and then leverage _apply_support_method for DataFrame.mad. It's not 10x faster, but it gets the job done well.

import cudf
import numpy as np

def mad(self):
    # mad formula
    n = len(self)
    m = self.mean()
    mad = ((self - m).abs() / n).sum()
    return mad

# 1 million rows
s = cudf.Series(np.random.normal(10,5,1_000_000))
ps = s.to_pandas()

%time mp = ps.mad()
%time mg = mad(s)
print(mp)
print(mg)
CPU times: user 31.9 ms, sys: 0 ns, total: 31.9 ms
Wall time: 32 ms
CPU times: user 0 ns, sys: 7.24 ms, total: 7.24 ms
Wall time: 23.2 ms
3.990811998439671
3.990811998439673

import cudf import numpy as np def mad(self): # mad formula n = len(self) m = self.mean() mad = ((self - m).abs() / n).sum() return mad nrows = (1e6, 10e6) for n in nrows: s = cudf.Series(np.random.normal(10, 5, int(n)), dtype="float32") ps = s.to_pandas() print(f"{int(n):,} rows:") %time mp = ps.mad() %time mg = mad(s) print() 1,000,000 rows: CPU times: user 22.9 ms, sys: 0 ns, total: 22.9 ms Wall time: 22.5 ms CPU times: user 782 µs, sys: 3.24 ms, total: 4.02 ms Wall time: 3.89 ms 10,000,000 rows: CPU times: user 213 ms, sys: 100 ms, total: 313 ms Wall time: 313 ms CPU times: user 1.23 ms, sys: 4.25 ms, total: 5.49 ms Wall time: 5.24 ms

rapidsai / cudf

[FEA] Series and DataFrame mean absolute deviation #3676