Closed lazysjb closed 5 years ago
There's a common misconception that Snorkel is just an efficient labelling tool. The fact is that unless you use noise-aware training, generating noisy labels is useless. The standard loss functions assume a single, golden label for every candidate; Snorkel trains with a vector of noisy labels for every candidate.
@chaturv3di thank you for the reply, but could you elaborate on this please? (sorry if this is a very basic question) I understand that the noisy label (output from generative model through genmodel.marginals) is used in training, but still having a hard time understanding what the difference between the 'standard loss function' vs. 'the noise aware training' during the discriminative classifier in Snorkel. My understanding was that if we have loss function as: $CE = -\sum{i \in{C}}y_i \times log(\hat{y_i})$ (where hat{y_i} is the output prob of the softmax layer of discriminative classifier) then instead of the standard case where y_i is in [0, 1] in Snorkel we can just feed continuous values for y_i (y_i would correspond to the marginals output from Snorkel generative model). Looking at your comment it seems that my understanding is wrong so it would be very helpful if you could explain it further.
Hi @lazysjb excited to hear that you're using Snorkel! See papers for more detail on precise definition, but the noise aware loss function is just taking the expected (weighted) loss with respect to the probabilistic (e.g. confidence weighted) training labels.
However, @chaturv3di generating labels with Snorkel is most certainly not useless without this loss. The core idea is that labeling functions will have differing levels of accuracy, coverages, and correlations, leading to overlaps and disagreements in their output labels. Snorkel addresses this by learning the accuracies and correlations of the labeling functions, and using this to re-weight and combine their labels, thus resolving conflicts in the labels.
So, while you would not be getting the full benefit of the probabilistic training labels, you could ignore this aspect (e.g. hard threshold) and still have Snorkel work just fine!.
Hope this helps- happy to clarify more later tonight!
@chaturv3di I think the confusion here might be that @lazysjb is talking about the output of the generative label model (i.e. once the vector of LF output labels have been combined into a single label by Snorkel), not the output of the LFs going into the generative model? Either way happy to chat more about this later!
Thank you much @ajratner, this is really helpful!
So after taking a look at section 2.3 of https://arxiv.org/pdf/1711.10160.pdf,
my understanding of "expected (weighted) loss with respect to the probabilistic (e.g. confidence weighted) training labels" means, for example, if the generative label model output [0.1, 0.3, 0.6] for a particular data point (in a 3 class setting), then the noise-aware loss would be 0.1 * CE(y_hat, y=[1, 0, 0]) + 0.3 * CE(y_hat, y=[0, 1, 0]) + 0.6 * CE(y_hat, y=[0, 0, 1])
(where y_hat is discriminative model output prob / please correct me if I'm wrong though)
Also, would you mind clarifying what you mean by you could ignore this aspect (e.g. hard threshold)
please?
Really appreciate it!
I stand corrected. I read @lazysjb's question incorrectly, but I'm glad that I still chimed in because this discussion has turned out to be informative. Thanks @ajratner and @lazysjb.
If I may venture a guess regarding the use of "hard threshold", I think Alex means that you could convert probabilistic labels into "standard labels" using some heuristics and then use the standard cross-entropy loss. For example, if the generative model associates a probabilistic label [0.1, 0.3, 0.6]
with a record R
, you could:
(R, [0, 0, 1])
, or0.25
, choose all the classes with probabilities above the threshold and create multiple training points (R, [0, 1, 0])
and (R, [0, 0, 1])
, orDoes this make sense?
@chaturv3di I see, that makes sense to me thank you very much!
@chaturv3di thanks for the clarification above and agreed, it's turned into a useful conversation! :) . And @lazysjb your understanding above LGTM!
@ajratner I wanted to reopen this as I am a bit confused after I see the noise aware loss functions. In the Pytorch version, the implementation seems to match my understanding on what we discussed in the above thread. (https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/pytorch/noise_aware_model.py#L21)
However, in the TF version, the implementation seems to be different in that it doesn't do the 'weighted' training loss (please correct me if I'm wrong). (https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/tensorflow/noise_aware_model.py#L60)
If my interpretation is correct, would there be a particular reason for the difference in the two?
The reason that I am asking this is I would like to use my own discriminative classifier (in keras/tf) with custom loss function.
Thank you!
Hey @lazysjb it ends up being the same thing, just different implementations! I believe there's an explanation here: https://hazyresearch.github.io/snorkel/blog/dp_with_tf_blog_post.html. Hope this helps, closing for now but feel free to re-open!
Thanks a lot for this project! I'm currently thinking about using Snorkel for a sentiment analysis project I'm working on to label documents and have some basic question as below.
The question that I have is that when training a discriminative classifier, I see that there is a custom loss function defined (I'm looking the TF version): https://github.com/HazyResearch/snorkel/blob/master/snorkel/learning/tensorflow/noise_aware_model.py#L60 I'm just wondering if there is any difference in using this custom loss function vs. just using a keras defined
categorical_crossentropy
(and having the last layer's activation besoftmax
). The reason I ask is that I would like to use my own discriminative classifier after generating the noise-aware labels using Snorkel.Thank you!