snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.82k stars 857 forks source link

What prevent the machine learning model from overfiting to the labeling function? #1705

Closed alimamdouh212 closed 2 years ago

alimamdouh212 commented 2 years ago

In other words, can any machine learning model generalize beyond the labeling functions? In the "data programing" paper it was assumed that the true class labels is independent of the features given the result of the labeling functions. As I understand, that means that all the predective power of the data is encoded in the labeling functions, if so, why train a model? If we can just run the labeling function.

humzaiqbal commented 2 years ago

Hi alimamdouh212,

There are multiple scenarios in which the end model does better. First, it can capture patterns that aren't found in any of the LFs (but happened to coincide with some of the LF-provided ones). Also, LFs are sometimes not servable at test time, but an end model is trained on features that are. An example is you train an image model on labels produced from text LFs on image, text pairs. At training time these are available, but at test time you only see images, so the LFs say nothing at all. We touch on this and more in a blog post on using Snorkel to build AI applications https://snorkel.ai/how-to-use-snorkel-to-build-ai-applications/

The assumption in the paper was made to show bounds on the rate that a particular algorithm, e.g., SGD, produces good estimates of the accuracy parameters. Generalization can happen with or without this assumption though.

Hope this helps!

alimamdouh212 commented 2 years ago

@humzaiqbal thanks for your response. But won't mimicking the the LFs by the discriminant model always outform any other model in the eyes of the lose function resulted from LFs? Does Snorkel apply a kind of regularization in order to prevent this?

alimamdouh212 commented 2 years ago

What I an saying is that the noise isn't random, there is a pattern in the data which the noise follow.

bhancock8 commented 2 years ago

Good question, @alimamdouh212. I agree with all the points in Humza's response. To add some more detail around your follow-on question: 1) In general, we recommend not passing through the LF votes directly as additional features for the end model so that it cannot simply assign high weights to those features 2) Beyond that, as you suggest, nearly all modern discriminative ML models include regularization terms which will prevent the model from placing too much emphasis on a very small number of features. 3) Because it is often advantageous to use what we call "non-servable" features in LFs (i.e. features that won't be available at test time and therefore can be used to supervise but should not be used for model training), there is an extra forcing function for the model to learn to depend on other features that collectively correlate strongly with those features that were used to shape the dataset, without actually being able to learn weights for them directly.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.