snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Question: models that handle (or not) probabilistic labels #1574

Closed RicSpd closed 4 years ago

RicSpd commented 4 years ago

Hi, after reading your papers and the first of your tutorials, I'm still not so sure about which models can handle probabilistic labels and which cannot. Until now, I made the following distinction:

Is this distinction correct or can it be integrated/improved?


Moreover, another question regarding this topic. If I need to discretize the predicted probabilities obtained by predict_proba() - for instance, I want to assign label 1 to the observations whose positive-class probability predicted by the LabelModel is larger than a threshold t - does it make sense to use a validation set with gold labels (distinct from the development and the test sets) and tune the threshold t in order to obtain the maximum accuracy/F1-score on this validation set, and then apply the optimized threshold to discretize the predicted probabilities of the unlabeled training set too?


I hope I've been clear in presenting my questions; in case I will edit them.

P.S. Great job with the Snorkel project, I find the applications very interesting and useful!

ajratner commented 4 years ago

Hi @RicSpd thanks for the great questions!

Re: the first one: any supervised machine learning model that is being trained in the standard way (to maximize the expected prob of the training data) can be modified to accept probabilistic labels- it's intuitively just re-weighting how much to weight the labels in the training objective. As examples, we support log reg as a default model in the repo, and many others have been used in the Snorkel OSS community!

Re: the second question- yes, you can definitely do that!

Thanks! Alex