snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Conflict vs Overlaps, Correct & Incorrect #1546

Closed durgeshiitj closed 4 years ago

durgeshiitj commented 4 years ago

I wrote 10 labeling functions(regex based) and when running on the training set the scores for overlaps and conflicts are always similar (w.r.t respective labels, row wise). I didn't quite understand the real meaning of overlaps and conflicts. What could be the reason which makes these parameters different? Also, how's correct and incorrect get calculated? In my data-set original number of datapoints were 1000 for one label. But in correct and incorrect the number was 900 and 50 which doesn't add up to 1000. Could I get an answer on this?

paroma commented 4 years ago

Thanks for the question! As described in this tutorial, labeling functions can assign labels or abstain for each data point. This could be the reason that with 1000 data points in your dataset, only 950 of them received a label. Correct and incorrect are calculated over the datapoints that received a label from the labeling function and have a ground truth label associated with it.

Overlaps refer to labeling functions that label the same data point while conflicts refer to labeling functions that assign different labels to the same data point.