snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Documentation on LabelModel Hyperparameter Tuning #1664

Closed hardianlawi closed 2 years ago

hardianlawi commented 3 years ago

Is your feature request related to a problem? Please describe.

I understand that we could change the hyperparameters such as lr, l2, and so forth as listed here. However, I have been having trouble understanding the impact of these hyperparameters on the generated labels. In traditional supervised learning, we could increase l2 to avoid overfitting to the data, but the notion of overfitting is a bit unclear here. I guess it's similar to training a self-supervised representation and the only way to evaluate everything, in the end, is through the downstream task. I saw this tutorial which does not use the default hyperparameters but did not provide any rationale.

Describe the solution you'd like

It would be great if there is any documentation or tutorial that shows how the hyperparameters would affect the generated labels from LabelModel. Any pointers or references would also be very much appreciated.

vkrishnamurthy11 commented 3 years ago

Thanks for letting us know @hardianlawi . We will be sure to add more resources and documentation for this section.

hardianlawi commented 3 years ago

@vkrishnamurthy11 Do you know of any references that have tried to explain this before?

fredsala commented 3 years ago

Hi Hardian!

There’s no single reference to look up; however, you can gain intuition for the effects of these hyperparameters by just comparing them to their versions in training a supervised model, and then relating the learned model to the generated labels.

To understand the comparison, the following is a useful way to think about what the label model does. The label model is trying to learn the parameters mu for a joint distribution P_mu over the labeling functions (and the latent label). The goal is to make the learned P_mu have agreement/disagreement rates (ie, P(LF_i = LF_j) over all the pairs i,j) close to the ones we observe from the labeling function outputs. So the loss involves these agreement/disagreement rates, and we optimize it using conventional optimizers (e.g., SGD, adam).

We get the usual goals when setting these hyperparameters: lr needs to be set high enough so that mu converges reasonably quickly, but low enough so that mu doesn’t bounce around wildly; the regularization parameter should be set high enough that mu isn’t reproducing each noisy observed agreement/disagreement rate exactly, but low enough so that we’re still fitting, and so on.

Then the way to relate the learned mu to the generated labels is by thinking of the learned labels as conditional accuracies that are used to weight the labeling function outputs. A good fit of mu (not under/over fitting) means that the accuracies are right, so the generated probabilistic labels correspond to the right “balance” of LFs. Underfitting usually results in the learned accuracies being roughly equal, so that the generated labels are closer to equally-weighted LFs, equivalently majority vote. This under-weights good LFs. On the other hand, overfitting places too much weight on particular LFs, favoring them even when they shouldn’t be based on their true accuracies.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.