Open matheusfacure opened 2 years ago
WDYT about using map on the dicts instead?
import numpy as np
import pandas as pd
nrows = 1_000_000
ncols = 20
n_unique_vals = 100
df = pd.DataFrame(
np.random.randint(0, n_unique_vals, (nrows, ncols)),
columns=[f'x{i}' for i in range(ncols)],
dtype=str
)
cols_to_replace = [f'x{i}' for i in range(0, 20, 2)]
vec = {col: {str(i): str(i + 1) for i in range(n_unique_vals - 10)} # last 10 values were not seen
for col in cols_to_replace}
# define one of the replacement columns as float
df['x0'] = df['x0'].astype('float')
vec['x0'] = {float(k): float(v) for k, v in vec['x0'].items()}
replace_unseen = -1
def apply_replacements(df, columns, vec, replace_unseen):
def column_categorizer(col: str):
return np.select(
# the original had an and here so I guess it should be &
[df[col].isna() & (df[col].dtype == "float"), ~df[col].isin(vec[col].keys())],
[np.nan, replace_unseen],
df[col].replace(vec[col])
)
return df.assign(**{col: column_categorizer(col) for col in columns})
%time res1 = apply_replacements(df, cols_to_replace, vec, replace_unseen)
# Wall time: 1min 22s
# proposal
def apply_replacements2(df, columns, vec, replace_unseen):
def column_categorizer(col: str):
replaced = df[col].map(vec[col])
unseen = df[col].notnull() & replaced.isnull()
replaced[unseen] = replace_unseen
return replaced
return df.assign(**{col: column_categorizer(col) for col in columns})
%time res2 = apply_replacements2(df, cols_to_replace, vec, replace_unseen)
# Wall time: 3.93 s
pd.testing.assert_frame_equal(res1, res2)
WOW! What is this magic? How does map works?
The main difference is that replace only changes the values you provide in the dict, whereas map tries to replace all of them and when there isn't a match it sets the value to null, which in this case I think is helpful for us because we can get the ones that didn't match very easily.
Merging #207 (8c5c9b0) into master (3cd7bec) will decrease coverage by
0.39%
. The diff coverage is93.20%
.
@@ Coverage Diff @@
## master #207 +/- ##
==========================================
- Coverage 94.69% 94.29% -0.40%
==========================================
Files 25 34 +9
Lines 1507 2050 +543
Branches 203 269 +66
==========================================
+ Hits 1427 1933 +506
- Misses 48 80 +32
- Partials 32 37 +5
Impacted Files | Coverage Δ | |
---|---|---|
src/fklearn/causal/validation/cate.py | 0.00% <0.00%> (ø) |
|
src/fklearn/data/datasets.py | 100.00% <ø> (ø) |
|
src/fklearn/tuning/parameter_tuners.py | 79.48% <ø> (ø) |
|
src/fklearn/tuning/selectors.py | 90.47% <ø> (ø) |
|
src/fklearn/validation/validator.py | 88.88% <71.42%> (-5.40%) |
:arrow_down: |
src/fklearn/preprocessing/splitting.py | 95.00% <92.59%> (-0.84%) |
:arrow_down: |
src/fklearn/training/calibration.py | 96.36% <94.73%> (-3.64%) |
:arrow_down: |
src/fklearn/causal/cate_learning/meta_learners.py | 94.93% <94.93%> (ø) |
|
src/fklearn/training/transformation.py | 93.97% <95.34%> (+0.04%) |
:arrow_up: |
src/fklearn/validation/evaluators.py | 93.95% <96.29%> (+4.32%) |
:arrow_up: |
... and 18 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Status
IN DEVELOPMENT
Todo list
Background context
Pandas .apply is incredibly slow to run. Since this function is used in multiple learners, speeding it up should yield tremendous grains in performance. As a side note, we should never introduce new call
.apply
. They are a major source of headacheDescription of the changes proposed in the pull request
Vectorize
apply_replacements