snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Different results and accuracy down to 10% with PandasParallelLFApplier vs PandasLFApplier in Snorkel 0.9.5 #1587

Closed durgeshiitj closed 4 years ago

durgeshiitj commented 4 years ago

Issue description

I ran snorkel(v 0.9.5) on a dataset using PandasParrallelLFApplier and to my surprise I got 10% accuracy which I was expecting to be 90%. Then tried to use PandasLFApplier just to cross verify and I got 90% accuracy. When I compared the LabelMatrixs, both were not eqauls.

Before I was using 0.9.3 never faced problem. Just to cross verify I ran the same dataset on a different sytem having version 0.9.3 with both PandasParallelLFApplier and PandasLFApplier and found that in 0.9.3, both are yielding same Label-Matrix and same accuracy with same LFAnalysis.

Expected behavior

Both LFAppliers should yield similar results.

Screenshots

I'm attaching screenshots for your reference.

V 0.9.5 Analysis:

PandasLFApplier: nonp095

PandasParallelLFApplier: paralle095

Label-Matrix Comparison: npequals095

V 0.9.3 Analysis:

PandasLFApplier: pandasLfApp

PandasParallelLFApplier: parallel

Label-Matrix Comparison: noeqals093

System info

Additional context

Please look into this asap.

henryre commented 4 years ago

Hi @durgeshiitj, apologies for the delayed response here! This is likely due to using an unsorted index with PandasParallelLFApplier. I've opened up #1589 but in the meantime, you can just use the standard PandasLFApplier or sort your index before using PandasParallelLFApplier so that the order of the rows of L is expected.

durgeshiitj commented 4 years ago

Hi @durgeshiitj, apologies for the delayed response here! This is likely due to using an unsorted index with PandasParallelLFApplier. I've opened up #1589 but in the meantime, you can just use the standard PandasLFApplier or sort your index before using PandasParallelLFApplier so that the order of the rows of L is expected.

Hi Henry, Thanks for following up. However, I tried debugging at my end as well. I found out that the system where Snorkel 0.9.5 is installed, the Dask version was 2.14.2 and where 0.9.3 was installed the Dask version was 2.5.2. So I tried downgrading Dask to 2.5.2 to run with Snorkel 0.9.5 and to my surprise there the PandasParallelLFApplier worked normally. So I please check that as well, because in requirement Dask version mentioned is <3 so 2.14 should not have caused any issue as well.

henryre commented 4 years ago

Hi @durgeshiitj, thanks for reporting and we'll look into version compatibility on our side!

durgeshiitj commented 4 years ago

Hi @durgeshiitj, thanks for reporting and we'll look into version compatibility on our side!

I didn't get any update on the issue

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.