snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

PandasParallelLFApplier does not preserve the order of the rows #1524

Closed hcvazquez closed 4 years ago

hcvazquez commented 4 years ago

Issue description

I'm using PandasParallelLFApplier to apply labeling functions to a pandas dataframe with 5000 rows.

Code example/repro steps

Using PandasParallelLFApplier

# Apply the LFs to the unlabeled training data
applier = PandasParallelLFApplier(lfs)
topic_labeling = applier.apply(df[:5000])
topic_labeling
output: array([
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       ...,
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1,  1]])

Same code using PandasLFApplier

output: array([
       [-1, -1, -1, -1],
       [-1,  1, -1, -1],
       [-1, -1, -1, -1],
       ...,
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1,  1]])

Second row is different.

Expected behavior

I would expect the same result for both. Labeling coverage and overlaps is the same for both. Because of that the problem has to be the order of the rows.

System info

henryre commented 4 years ago

Hi @hcvazquez, great question. This is due to index sorting, and isn't reflected well in the docs right now (but on our list to update). This was discussed on the Spectrum thread here: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13

henryre commented 4 years ago

Closing for now, feel free to re-open!