Closed srimugunthan closed 4 years ago
Hi, In the original example, in which the drop was from 90% to 30%, i found an issue in the code. I see that it happens only when i use PandasParallelLFApplier to get the label matrix. With PandasLFapplier it is fine.
i check the matrices generated from PandasLFApplier and PandasParallelLFApplier and they were different. Below is the code from notebook, which i used to check.
df_full = pd.concat([df_single,df_multilabel] df_full.index.is_unique True
lm1 =applier.apply(df=df_full) lm2 =applier_regex.apply(df=df_full,n_parallel=8)
np.array_equal(lm1, lm2) False
Is there anything i am missing.
Hi @srimugunthan thanks for surfacing this! At the current moment, the master branch version of Snorkel is not configured to support multi-label, though we've certainly applied Snorkel here (e.g. https://www.snorkel.org/blog/superglue / multi-task formulation...). So I'm not surprised there are some issues here- perhaps, since Snorkel's label model is expecting a single label, it's just taking e.g. the last one per data point, but this order is getting shuffled when applied in parallel?
Either way, we'll look into this to make sure not an issue with PandasParallelLFApplier. If, as I suspect, it's just an issue with multi-label support, we'll put on the roadmap!
@ajratner @henryre 1) I have checked in the spam classify example code with PandasParallelLFAppluer and plain PandasLFApplier https://github.com/srimugunthan/snorkeldebugging/blob/master/spamClassify.ipynb I do see the Label matrices are different, although the summary metrics are same.
2) Isnt the multi-task formulation for hierarchical labelling?. For multilabel(same-level,manylabels) case, we used the approach suggested in this article: https://towardsdatascience.com/using-snorkel-for-multi-label-annotation-cc2aa217986a We look at the labelmodel's prediction probability values , and pick additional labels which are close to maximum probability class. Let me know if this approach can be followed.
3) In the original examplenotebook i shared ( https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb ) i see the single label accuracy shrink by 4 to 6% when multilabel data is added. This is not much and not sure if qualifies as an issue. But you can reproduce the issue from the notebook and let us know your comments.
Hi @srimugunthan, sorry for the delayed reply! In response to the PandasParallelLFApplier
issue, I've opened up https://github.com/snorkel-team/snorkel/issues/1589. In the meantime, you can either use the standard PandasLFApplier
or sort the index of the original DF before using the PandasParallelLFApplier
so that the index matches.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Issue description
We have a dataset with records which will be either have one label or multiple labels. To verify the label model predictions, we filtered out from the original data, the records with only one label. Doing labelmodel.fit on the single-labelled data was giving accuracy of more than 90%.
But when we did labelmodel.fit on the whole data the above accuracy for singlelabelled datapoints dropped drastically to 30%.
Code example/repro steps
i was able to reproduce the bug with some generated label matrix https://github.com/srimugunthan/snorkeldebugging/blob/master/snorkeldebug.ipynb Although here the accuracy drop in the generated data is not drastic, it illustrates the scenario
Expected behavior
the subset of data with single labels should have the same accuracy.
System info
used snorkel 0.9.3 on linux