snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Modeling non-independant labeling functions? #1641

Closed moscow25 closed 3 years ago

moscow25 commented 3 years ago

My understanding is that in the Snorkel/Google paper, modeling feature co-variance is noted. However in the current implementation, as far as I can tell, features are all assumed independent.

https://github.com/snorkel-team/snorkel/blob/ed7771812a0484ad593485dc9e7c67091d483e37/snorkel/labeling/model/label_model.py#L103

Do I mis-understand, or is there a way of handling labels being very highly correlated? For example I may have a classifier, which is more accurate for short text than for longer text. At the moment I can't really create "independent" features for "model" and "model_280" which only applies to longer text. Since this skews the bootstrapping of the model.

Please let me know if I do not interpret this correctly?

henryre commented 3 years ago

Hi @moscow25, please take a look at this releated thread: https://github.com/snorkel-team/snorkel/issues/1596. In this case, you could manually resolve the dependency in the labeling function itself (e.g. by running the shorter model if the text field is below some character length limit), and you could also empirically test (e.g. using a hold-out set) whether adding both models as independent labeling functions actually helps performance.

moscow25 commented 3 years ago

Thanks @henryre. I appreciate the link to #1596. Empirically, the function works ok, good to know there's a formula one could implement from the paper.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.