Closed e-hossam96 closed 11 months ago
In general I would maintain a training/validation set split as a best practice. You can overfit models to weakly labeled data, and there is little value to a valid split for hyperparameter optimization, eg, if those samples are also in your training data.
So, if I'm getting it right, you suggest sticking to a training/validation split from the weakly labeled data (generated by the label model). Sounds good. I'm just worried that it is not acceptable theoretically.
I have a problem that this data is still noisy. So, if I'm doing hyperparameter tuning while measuring the performance on noisy data I may likely get a bad performance on the gold data.
Aside from that, do you have any extra suggestions or guidelines for training the end model?
So on the valid split issue, there are several other basic paths you could take depending on how much gold data you have:
You could also try a combination of the above.
As far as training your end model, I would try optimizing with a "noise-aware" version of your objective where you weight the objective for each possible target value by the associated probabilities assigned by the label model (is in the published lit and an example is here for cross-entropy in classification).
In practice we've found that sometimes you're better off simply using thresholded label model predictions (taking only labels > some confidence and not considering the full probabilities) so you could also treat that as another hyperparameter to optimize.
Happy to unpack any of that.
Thank you so much. This is rich info.
I'm currently using this noise-aware loss approach Active-Passive Loss. I will combine it with a threshold over the confidence scores of the label model. Note that I found that training with soft labels from the label model tends to mimic the label model's performance more than generalizing to the problem (not sure about this insight though).
Thanks for your help
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Is it OK to use the labeling functions development labeled dataset as a validation dataset and train on the entire weakly labeled data? Or do I need to split the weakly labeled data to training and validation?