nubank / fklearn

fklearn: Functional Machine Learning
Apache License 2.0
1.51k stars 165 forks source link

Vectorize apply_replacement #207

Open matheusfacure opened 2 years ago

matheusfacure commented 2 years ago

Status

IN DEVELOPMENT

Todo list

Background context

Pandas .apply is incredibly slow to run. Since this function is used in multiple learners, speeding it up should yield tremendous grains in performance. As a side note, we should never introduce new call .apply. They are a major source of headache

Description of the changes proposed in the pull request

Vectorize apply_replacements

jmoralez commented 2 years ago

WDYT about using map on the dicts instead?

import numpy as np
import pandas as pd

nrows = 1_000_000
ncols = 20
n_unique_vals = 100
df = pd.DataFrame(
    np.random.randint(0, n_unique_vals, (nrows, ncols)),
    columns=[f'x{i}' for i in range(ncols)],
    dtype=str
)
cols_to_replace = [f'x{i}' for i in range(0, 20, 2)]
vec = {col: {str(i): str(i + 1) for i in range(n_unique_vals - 10)}  # last 10 values were not seen
       for col in cols_to_replace}  
# define one of the replacement columns as float
df['x0'] = df['x0'].astype('float')
vec['x0'] = {float(k): float(v) for k, v in vec['x0'].items()}
replace_unseen = -1

def apply_replacements(df, columns, vec, replace_unseen):
    def column_categorizer(col: str):
        return np.select(
            # the original had an and here so I guess it should be &
            [df[col].isna() & (df[col].dtype == "float"), ~df[col].isin(vec[col].keys())],
            [np.nan, replace_unseen],
            df[col].replace(vec[col])
        )
    return df.assign(**{col: column_categorizer(col) for col in columns})
%time res1 = apply_replacements(df, cols_to_replace, vec, replace_unseen)
# Wall time: 1min 22s

# proposal
def apply_replacements2(df, columns, vec, replace_unseen):
    def column_categorizer(col: str):
        replaced = df[col].map(vec[col])
        unseen = df[col].notnull() & replaced.isnull()
        replaced[unseen] = replace_unseen
        return replaced
    return df.assign(**{col: column_categorizer(col) for col in columns})
%time res2 = apply_replacements2(df, cols_to_replace, vec, replace_unseen)
# Wall time: 3.93 s

pd.testing.assert_frame_equal(res1, res2)
matheusfacure commented 2 years ago

WOW! What is this magic? How does map works?

jmoralez commented 2 years ago

The main difference is that replace only changes the values you provide in the dict, whereas map tries to replace all of them and when there isn't a match it sets the value to null, which in this case I think is helpful for us because we can get the ones that didn't match very easily.

codecov-commenter commented 2 years ago

Codecov Report

Merging #207 (8c5c9b0) into master (3cd7bec) will decrease coverage by 0.39%. The diff coverage is 93.20%.

@@            Coverage Diff             @@
##           master     #207      +/-   ##
==========================================
- Coverage   94.69%   94.29%   -0.40%     
==========================================
  Files          25       34       +9     
  Lines        1507     2050     +543     
  Branches      203      269      +66     
==========================================
+ Hits         1427     1933     +506     
- Misses         48       80      +32     
- Partials       32       37       +5     
Impacted Files Coverage Δ
src/fklearn/causal/validation/cate.py 0.00% <0.00%> (ø)
src/fklearn/data/datasets.py 100.00% <ø> (ø)
src/fklearn/tuning/parameter_tuners.py 79.48% <ø> (ø)
src/fklearn/tuning/selectors.py 90.47% <ø> (ø)
src/fklearn/validation/validator.py 88.88% <71.42%> (-5.40%) :arrow_down:
src/fklearn/preprocessing/splitting.py 95.00% <92.59%> (-0.84%) :arrow_down:
src/fklearn/training/calibration.py 96.36% <94.73%> (-3.64%) :arrow_down:
src/fklearn/causal/cate_learning/meta_learners.py 94.93% <94.93%> (ø)
src/fklearn/training/transformation.py 93.97% <95.34%> (+0.04%) :arrow_up:
src/fklearn/validation/evaluators.py 93.95% <96.29%> (+4.32%) :arrow_up:
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.