Closed moscow25 closed 3 years ago
Hi @moscow25, please take a look at this releated thread: https://github.com/snorkel-team/snorkel/issues/1596. In this case, you could manually resolve the dependency in the labeling function itself (e.g. by running the shorter model if the text field is below some character length limit), and you could also empirically test (e.g. using a hold-out set) whether adding both models as independent labeling functions actually helps performance.
Thanks @henryre. I appreciate the link to #1596. Empirically, the function works ok, good to know there's a formula one could implement from the paper.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
My understanding is that in the Snorkel/Google paper, modeling feature co-variance is noted. However in the current implementation, as far as I can tell, features are all assumed independent.
https://github.com/snorkel-team/snorkel/blob/ed7771812a0484ad593485dc9e7c67091d483e37/snorkel/labeling/model/label_model.py#L103
Do I mis-understand, or is there a way of handling labels being very highly correlated? For example I may have a classifier, which is more accurate for short text than for longer text. At the moment I can't really create "independent" features for "model" and "model_280" which only applies to longer text. Since this skews the bootstrapping of the model.
Please let me know if I do not interpret this correctly?