snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Support optimized PandasTFApplier #1557

Closed dchichkov closed 4 years ago

dchichkov commented 4 years ago

Issue description

PandasTFApplier is slow to the point of being unusable. Time it takes to process 100k Pandas dataframe:

It also doesn't allow adding new fields...

Code example/repro steps

import pandas as pd, numpy as np
from snorkel.augmentation import ApplyOnePolicy, PandasTFApplier, transformation_function

@transformation_function()
def add_area(x):
    x.area = (x.top - x.bottom) * (x.right - x.left)
    return x

df = pd.DataFrame(np.random.randn(100000, 4), columns=['left', 'top', 'right', 'bottom'])

tf_applier = PandasTFApplier([add_area], ApplyOnePolicy(n_per_original=1, keep_original=False))
tf_applier.apply(df)

Expected behavior

Process 100k elements in 1 second, not in 56 seconds (10x slower than directly with Pandas, not 560x slower).

%time df['area'] = [(row.top - row.bottom) * (row.right - row.left) for row in df.itertuples()]
df

Screenshots

Screenshot at 2020-03-11 10-58-40

System info

Additional context

vincentschen commented 4 years ago

@dchichkov — thanks for raising this! You're right, our current implementation is a bit slow, likely because we're supporting a more general case where transformations may be 1:many operations (as opposed to simply 1:1).

We could definitely make some improvements here — flagging this as an issue. Feel free to open a PR to make contribution yourself, as well!

dchichkov commented 4 years ago

Thanks! I also see very similar issue with slicing functions. A slicing function like:

@slicing_function()
def real_object(x):
    """Returns whether the object is a real object, not a reflection, shadow or depiction"""
    return x.Reflection == 'false' and x.Shadow == 'false' and x.Depiction == 'false'

Takes 50 milliseconds to apply with pandas and 5 seconds (100 times slower) with snorkel.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.