snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 859 forks source link

Restore original index with PandasParallelLFApplier #1589

Open henryre opened 4 years ago

henryre commented 4 years ago

Describe the solution you'd like

Using PandasParallelLFApplier includes an index sort on the original DataFrame, which can result in unexpected row order if the index is not sorted when passed in. This should be documented, or the original index order should be restored.

Discussion: https://spectrum.chat/snorkel/help/how-to-use-the-pandasparallelapplier~cf50f563-28e6-418c-93a3-337384566c13

Additional context

Related issues: #1587 #1581 #1524

bendeaton commented 4 years ago

+1 to this. We just spent several cycles tracking down this issue in a model that uses Snorkel.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

BenjaminFraser commented 3 years ago

Not sure on the exact solution you were looking for on this, but one possibility would be to add an additional default keyword arg to PandasParallelLFApplier on whether to index sort or not.

For example, at line 97 within snorkel/snorkel/labeling/apply/dask.py, dd.from_pandas() is called:

df = dd.from_pandas(df, npartitions=n_parallel)

If you want the index to remain unsorted, and prevent the problem highlighted in this issue, we can simply call dd.from_pandas() like so:

df = dd.from_pandas(df, npartitions=n_parallel, sort=False)

We might not always want this to be false however, since dask makes this by default for performance purposes and for obtaining meaningful divisions (Ref: https://github.com/dask/dask/issues/1428).

Therefore, it could be worth letting users call PandasParallelLFApplier with sort=True / False, as required. Perhaps it could be made clear in the documentation that this sorting occurs by default, and if users do not want this to happen, they should provide sort=False.

Just a suggestion anyway!