snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Evaluating GenerativeModel performance #953

Closed littlewine closed 5 years ago

littlewine commented 6 years ago

I am having a trouble choosing an appropriate way for evaluating the GenerativeModel performance. I am using a Grid/RandomSearch in order to choose the best parameters, which is currently picking the best by looking at the F1 score. However, I am getting a lot of probability labels very close to 0 or 1 (along with an F1 score ~ 60-64% on the validation/developer set). I still havent figured out whether it is because of the hyperparameters, or due to the high coverage / accuracy or similarity of my LFs. gm1 From my point of view, the problem with the label distribution above, is that their confidence of being correct is too high. In other words, if I feed those into my discriminative model they will falsely be perceived as gold labels, which will lead to bad performance.

I can manage to get different distributions, by changing the GM hyperparameters, without the F1 score does not change dramatically (at least in all cases). My point being, I would probably prefer a lower F1 score of ~5-10% with a more (sort of) uniform distribution of labels. gm4 gm3 gm2 gm5

However, it is neither clear or straightforward to pick the most suitable distribution and trying out all of them after another gridsearch for the discriminative model feels like an overkill.

This fact got me thinking that the F1 score (alone) might not be a good metric for evaluating which GM hyperparameters to choose. For this reason, I tried using the logistic loss instead of the F1-score, but picking the best params was still not that straightforward. What is your opinion on that? Do you believe the log-loss or maybe squared log-loss would be a more appropriate metric, or could you propose something else instead? Any comments regarding the label distributions posted above? Are there any particular parameters you would suggest lowering/rising? And does any of the distributions above seem more reasonable to use for some reason?

chrismre commented 6 years ago

Great points and question! I'll leave this to the experts, but two quick comments:

ajratner commented 6 years ago

Hi @littlewine

This is a great question and line of thinking- thanks for posting!

First of all: if possible, I'm curious about basic stats of your labeling functions- how many, what degree of overlap/conflict, and do you have some idea of how noisy you expect them to be on average? In general, if you have a bunch of pretty good labeling functions, you'd expect the majority of their labels to be correct, making the original distribution you showed potentially reasonable. Then, as Chris mentioned, the goal of the discriminative model is to generalize beyond the points they label.

Second: are you reporting (a) the generative model or (b) discriminative model scores in your description above? Seems like we'd want to see how the different label distributions you explore effect (b); we wouldn't expect the distribution shape to effect (a) much

Agree overall that this is a really cool area to look into further! Let us know what you find!

littlewine commented 6 years ago

Thank you for your replies.

What I am doing is experimenting on using ML classifiers (pretrained on a smaller gold set) as LFs. To some extent, it is like trying to bootstrap more (probabilistic) training data for the LSTM, or build a model ensemble with data programming. This smaller gold training set consists of about 6.5K (balanced) candidates (originally 13K candidates with 1:4 class imbalance). The unlabelled set I am trying to use for denoising & training the discriminative model consists of 80K candidates. I should note here that I have a concern about the unlabelled set I am using: a) either the class imbalance could be bigger there due to irrelevant documents or b) my ML classifiers might not generalize well to those external documents. The following histogram corresponds to an unweighted average of the predictions of the classifiers I am using (which suggests a 1:14 class imbalance and I realize is problematic - I'll come to that later): unweighted voting marginals

When building my classifiers, I am trying to make them capture different "views" of the data, while avoiding (to the extend that its possible) to use too many types of classifiers of a certain type (eg. BOW) vs another type (LSTM). At the moment, I am using 12 different classifiers (LFs) with F1 scores ranging from 50-60%. I am having them vote on everything (coverage 100%) and they agree with each other (pairwise) on average by 79% in the validation set (with class imbalance again about 1:4).

Second: are you reporting (a) the generative model or (b) discriminative model scores in your description above?

@ajratner Yes, the scores I reported above are the scores of the generative model.

Seems like we'd want to see how the different label distributions you explore effect (b); we wouldn't expect the distribution shape to effect (a) much

I'm not sure if I understood correctly that part: You do not expect the label distributions to effect the generative model or the discriminative model?

Also, let me give you some additional details regarding the experiments I've done so far. First off, if I train snorkels default bi-LSTM with the goldset training set (0-1 labels), I get 55% F1 score. Then, if I undersample (or set rebalance = True) and feed the first distribution of probabilistic labels (which is really close to 0-1), I am getting back a ~30-35% F1 score. If I include the original training goldset (where the ML classifiers used as LFs were initially trained) into the LSTM training set, I can increase the LSTM F1-score to 43.2% (which is still much lower than if I would only train with the small goldset).

I am also not sure whether the size of the (unlabeled) training set is appropriate anymore, as after the undersampling/rebalancing it has been reduced to a size comparable to that of the training gold set.

Edit: I am also not sure to what extend the class imbalance and the high accuracy (which is again-reinforced by the class imbalance) affect the results of the generative model. I quote from the documentation of GenerativeModel.learned_lf_stats():

        WARNING: This uses Gibbs sampling to estimate these values. This will
                 tend to mix poorly when there are many very accurate labeling
                 functions. In this case, this function will assume that the
                 classes are approximately balanced.

For this reason, I tried to "balance" the unlabeled dataset based on the average votes of the ML classifiers. However, I am not sure whether this was a good choice.

ajratner commented 6 years ago

Hi @littlewine this sounds very interesting! I'll think more about your overall experiment (which seems cool) later, but some quick answers / points rn:

bartgras commented 6 years ago

@littlewine I wonder how did your experiment go. Can you share some results?

littlewine commented 5 years ago

Hi @bartgras , I have submitted this work for a publication (it's currently under review), so unfortunately I cannot share much with you now. All i can say is that it looks promising, but there are certain drawbacks/imperfections to be solved. Nonetheless, I will make sure to come back and share the paper here once its accepted.

In the meantime, if you have any more questions or would like to discuss your use case, let me know!

bartgras commented 5 years ago

@littlewine No worries. Please share once published.

ajratner commented 5 years ago

@littlewine Definitely excited to see the work whenever you can share! Best of luck with the submission!

ajratner commented 5 years ago

Closing for now- but definitely keep us updated!!