snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 859 forks source link

Question: Why use the discriminative model at all? #1059

Closed mhigginslp closed 5 years ago

mhigginslp commented 5 years ago

If the generative model produces probabilistic labels, I don't understand why we would feed those into a discriminative model ... the whole purpose was to give a prediction given an input and we have that after the denoising LF's step.

Why not just use the probabilistic labels themselves?

bhancock8 commented 5 years ago

That's a good question, and perhaps not intuitive at first, but it's a big deal. If you check out the original Snorkel paper (https://arxiv.org/abs/1711.10160) or associated blog post, we explain the some of the reasons why (and we see the same trend in all our applications). One of them is that training the discriminative model allows us to generalize better. Consider for example if you use 20 labeling functions to label some of your data. Those two rules may only label 60% of your data; on the other 40%, you have no votes. But those 40% of examples will likely have a lot of features in common with examples in the 60% that you do have some noisy labels for. So you label whatever portion you can, and use those to learn weights over a much larger, richer feature set (e.g., learned representations in a deep learning model), which can then make predictions on any example, regardless of whether or not it's covered by one or more labeling functions.

ajratner commented 5 years ago

Hi @mhigginslp - just building on top of @bhancock8 's answer, and tagging this as a great FAQ question!

There are three main advantages that you might find from using the predicted labels of the generative model as training labels for another model---what we call the end discriminative model---rather than just using the generative model as your final classifier; for details on the below, see the Snorkel VLDB 2018 paper that @bhancock8 linked above:

(1) Generalization: As @bhancock8 detailed, the labeling functions often have incomplete coverage, and the end model can generalize beyond them, for example because (a) it learns to put weight on co-occurring features, and/or (b) it has some baked-in semantic knowledge (e.g. pre-trained word embeddings). In our experiments in the VLDB paper, we saw a 43% increase in recall on average from using the discriminative model!

(2) Scaling with Unlabeled Data: Another related advantage is that the end discriminative model will improve significantly with more unlabeled data, letting you take advantage of an often abundant resource. We show this empirically, and also provide the theoretical underpinnings in the NeurIPS 2016 paper.

(3) Cross-Modal Settings: Finally, another major reason is that you might want to train a discriminative model over a different---or even entirely disjoint---set of features than the generative model. One example from the paper is radiology setting, where we write LFs over text reports, but use the resulting labels to train an image model over the associated X-ray images, with the goal being to classify images at test time. Here, we clearly can't use the generative model at all!

Of course, this is ultimately an empirical question for your particular setting. If you want to just use the generative model---for example, because say you are able to write very high-coverage LFs---definitely feel free to! And thanks for the great question!