labelmodel.fit on a superset of data changes predictions of subset

srimugunthan commented 4 years ago

Issue description

We have a dataset with records which will be either have one label or multiple labels. To verify the label model predictions, we filtered out from the original data, the records with only one label. Doing labelmodel.fit on the single-labelled data was giving accuracy of more than 90%.

But when we did labelmodel.fit on the whole data the above accuracy for singlelabelled datapoints dropped drastically to 30%.

Code example/repro steps

i was able to reproduce the bug with some generated label matrix https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb Although here the accuracy drop in the generated data is not drastic, it illustrates the scenario

Expected behavior

the subset of data with single labels should have the same accuracy.

System info

used snorkel 0.9.3 on linux

srimugunthan commented 4 years ago

Hi, In the original example, in which the drop was from 90% to 30%, i found an issue in the code. I see that it happens only when i use PandasParallelLFApplier to get the label matrix. With PandasLFapplier it is fine.

i check the matrices generated from PandasLFApplier and PandasParallelLFApplier and they were different. Below is the code from notebook, which i used to check.

df_full = pd.concat([df_single,df_multilabel] df_full.index.is_unique True

lm1 =applier.apply(df=df_full) lm2 =applier_regex.apply(df=df_full,n_parallel=8)

np.array_equal(lm1, lm2) False

Is there anything i am missing.

ajratner commented 4 years ago

Hi @srimugunthan thanks for surfacing this! At the current moment, the master branch version of Snorkel is not configured to support multi-label, though we've certainly applied Snorkel here (e.g. https://www.snorkel.org/blog/superglue / multi-task formulation...). So I'm not surprised there are some issues here- perhaps, since Snorkel's label model is expecting a single label, it's just taking e.g. the last one per data point, but this order is getting shuffled when applied in parallel?

Either way, we'll look into this to make sure not an issue with PandasParallelLFApplier. If, as I suspect, it's just an issue with multi-label support, we'll put on the roadmap!

srimugunthan commented 4 years ago

@ajratner @henryre 1) I have checked in the spam classify example code with PandasParallelLFAppluer and plain PandasLFApplier https://github.com/srimugunthan/snorkeldebugging/blob/master/spamClassify.ipynb I do see the Label matrices are different, although the summary metrics are same.

2) Isnt the multi-task formulation for hierarchical labelling?. For multilabel(same-level,manylabels) case, we used the approach suggested in this article: https://towardsdatascience.com/using-snorkel-for-multi-label-annotation-cc2aa217986a We look at the labelmodel's prediction probability values , and pick additional labels which are close to maximum probability class. Let me know if this approach can be followed.

3) In the original examplenotebook i shared ( https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb ) i see the single label accuracy shrink by 4 to 6% when multilabel data is added. This is not much and not sure if qualifies as an issue. But you can reproduce the issue from the notebook and let us know your comments.

henryre commented 4 years ago

Hi @srimugunthan, sorry for the delayed reply! In response to the PandasParallelLFApplier issue, I've opened up https://github.com/snorkel-team/snorkel/issues/1589. In the meantime, you can either use the standard PandasLFApplier or sort the index of the original DF before using the PandasParallelLFApplier so that the index matches.

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

snorkel-team / snorkel