snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Concept Question #1726

Closed e-hossam96 closed 10 months ago

e-hossam96 commented 1 year ago

Is it OK to use the labeling functions development labeled dataset as a validation dataset and train on the entire weakly labeled data? Or do I need to split the weakly labeled data to training and validation?

cmglaze commented 1 year ago

In general I would maintain a training/validation set split as a best practice. You can overfit models to weakly labeled data, and there is little value to a valid split for hyperparameter optimization, eg, if those samples are also in your training data.

e-hossam96 commented 1 year ago

So, if I'm getting it right, you suggest sticking to a training/validation split from the weakly labeled data (generated by the label model). Sounds good. I'm just worried that it is not acceptable theoretically.

I have a problem that this data is still noisy. So, if I'm doing hyperparameter tuning while measuring the performance on noisy data I may likely get a bad performance on the gold data.

Aside from that, do you have any extra suggestions or guidelines for training the end model?

cmglaze commented 1 year ago

So on the valid split issue, there are several other basic paths you could take depending on how much gold data you have:

  1. Use some of the gold data for your valid split and no weakly labeled at all.
  2. Do a nested cross-validation (again with no weakly labeled data in the valid split) to maximize the amount of gold data you can use in your valid while getting a principled out-of-sample accuracy measure from your test split.
  3. Inspect how well calibrated the confidence scores from the label model are using your gold data, and if they look well calibrated, continue to use the weakly labeled data but compute your valid split scores by weighting each sample by label model confidence.

You could also try a combination of the above.

cmglaze commented 1 year ago

As far as training your end model, I would try optimizing with a "noise-aware" version of your objective where you weight the objective for each possible target value by the associated probabilities assigned by the label model (is in the published lit and an example is here for cross-entropy in classification).

In practice we've found that sometimes you're better off simply using thresholded label model predictions (taking only labels > some confidence and not considering the full probabilities) so you could also treat that as another hyperparameter to optimize.

Happy to unpack any of that.

e-hossam96 commented 1 year ago

Thank you so much. This is rich info.

I'm currently using this noise-aware loss approach Active-Passive Loss. I will combine it with a threshold over the confidence scores of the label model. Note that I found that training with soft labels from the label model tends to mimic the label model's performance more than generalizing to the problem (not sure about this insight though).

Thanks for your help

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.