snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 859 forks source link

[Question] Discriminative Model Loss Function #1116

Closed lazysjb closed 5 years ago

lazysjb commented 5 years ago

Thanks a lot for this project! I'm currently thinking about using Snorkel for a sentiment analysis project I'm working on to label documents and have some basic question as below.

The question that I have is that when training a discriminative classifier, I see that there is a custom loss function defined (I'm looking the TF version): https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/tensorflow/noise_aware_model.py#L60 I'm just wondering if there is any difference in using this custom loss function vs. just using a keras defined categorical_crossentropy (and having the last layer's activation be softmax). The reason I ask is that I would like to use my own discriminative classifier after generating the noise-aware labels using Snorkel.

Thank you!

chaturv3di commented 5 years ago

There's a common misconception that Snorkel is just an efficient labelling tool. The fact is that unless you use noise-aware training, generating noisy labels is useless. The standard loss functions assume a single, golden label for every candidate; Snorkel trains with a vector of noisy labels for every candidate.

lazysjb commented 5 years ago

@chaturv3di thank you for the reply, but could you elaborate on this please? (sorry if this is a very basic question) I understand that the noisy label (output from generative model through genmodel.marginals) is used in training, but still having a hard time understanding what the difference between the 'standard loss function' vs. 'the noise aware training' during the discriminative classifier in Snorkel. My understanding was that if we have loss function as: $CE = -\sum{i \in{C}}y_i \times log(\hat{y_i})$ (where hat{y_i} is the output prob of the softmax layer of discriminative classifier) then instead of the standard case where y_i is in [0, 1] in Snorkel we can just feed continuous values for y_i (y_i would correspond to the marginals output from Snorkel generative model). Looking at your comment it seems that my understanding is wrong so it would be very helpful if you could explain it further.

ajratner commented 5 years ago

Hi @lazysjb excited to hear that you're using Snorkel! See papers for more detail on precise definition, but the noise aware loss function is just taking the expected (weighted) loss with respect to the probabilistic (e.g. confidence weighted) training labels.

However, @chaturv3di generating labels with Snorkel is most certainly not useless without this loss. The core idea is that labeling functions will have differing levels of accuracy, coverages, and correlations, leading to overlaps and disagreements in their output labels. Snorkel addresses this by learning the accuracies and correlations of the labeling functions, and using this to re-weight and combine their labels, thus resolving conflicts in the labels.

So, while you would not be getting the full benefit of the probabilistic training labels, you could ignore this aspect (e.g. hard threshold) and still have Snorkel work just fine!.

Hope this helps- happy to clarify more later tonight!

ajratner commented 5 years ago

@chaturv3di I think the confusion here might be that @lazysjb is talking about the output of the generative label model (i.e. once the vector of LF output labels have been combined into a single label by Snorkel), not the output of the LFs going into the generative model? Either way happy to chat more about this later!

lazysjb commented 5 years ago

Thank you much @ajratner, this is really helpful! So after taking a look at section 2.3 of https://arxiv.org/pdf/1711.10160.pdf, my understanding of "expected (weighted) loss with respect to the probabilistic (e.g. confidence weighted) training labels" means, for example, if the generative label model output [0.1, 0.3, 0.6] for a particular data point (in a 3 class setting), then the noise-aware loss would be 0.1 * CE(y_hat, y=[1, 0, 0]) + 0.3 * CE(y_hat, y=[0, 1, 0]) + 0.6 * CE(y_hat, y=[0, 0, 1]) (where y_hat is discriminative model output prob / please correct me if I'm wrong though) Also, would you mind clarifying what you mean by you could ignore this aspect (e.g. hard threshold) please? Really appreciate it!

chaturv3di commented 5 years ago

I stand corrected. I read @lazysjb's question incorrectly, but I'm glad that I still chimed in because this discussion has turned out to be informative. Thanks @ajratner and @lazysjb.

If I may venture a guess regarding the use of "hard threshold", I think Alex means that you could convert probabilistic labels into "standard labels" using some heuristics and then use the standard cross-entropy loss. For example, if the generative model associates a probabilistic label [0.1, 0.3, 0.6] with a record R, you could:

  1. Select the class with highest probability and create a training point (R, [0, 0, 1]), or
  2. Assuming a threshold of 0.25, choose all the classes with probabilities above the threshold and create multiple training points (R, [0, 1, 0]) and (R, [0, 0, 1]), or
  3. Some other heuristic.

Does this make sense?

lazysjb commented 5 years ago

@chaturv3di I see, that makes sense to me thank you very much!

ajratner commented 5 years ago

@chaturv3di thanks for the clarification above and agreed, it's turned into a useful conversation! :) . And @lazysjb your understanding above LGTM!

lazysjb commented 5 years ago

@ajratner I wanted to reopen this as I am a bit confused after I see the noise aware loss functions. In the Pytorch version, the implementation seems to match my understanding on what we discussed in the above thread. (https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/pytorch/noise_aware_model.py#L21)

However, in the TF version, the implementation seems to be different in that it doesn't do the 'weighted' training loss (please correct me if I'm wrong). (https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/tensorflow/noise_aware_model.py#L60)

If my interpretation is correct, would there be a particular reason for the difference in the two?

The reason that I am asking this is I would like to use my own discriminative classifier (in keras/tf) with custom loss function.

Thank you!

ajratner commented 5 years ago

Hey @lazysjb it ends up being the same thing, just different implementations! I believe there's an explanation here: https://hazyresearch.github.io/snorkel/blog/dp_with_tf_blog_post.html. Hope this helps, closing for now but feel free to re-open!