Closed DavidKoleczek closed 3 years ago
Great idea! Similar functionality exists in the get_label_buckets
method under snorkel/analysis/error_analysis
: https://github.com/HazyResearch/snorkel/blob/e316d5700cbfd2243c0d5485537ef310fc0e7a1e/snorkel/analysis/error_analysis.py#L9. To use it, you would pass a gold labels vector and an LF labels vector, and that will return different error buckets you could pull from to get the indices of the corresponding data points where the LF was incorrect. If you wanted to submit a PR that wraps that method and has the functionality you described, that'd be great! You could likely stick it in that same error_analysis
file.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Is your feature request related to a problem? Please describe.
When using
LFAnalysis
and thelf_summary
method I often find myself wondering what the incorrect instances for a particular labeling function actually are. It would be useful to have a way to return all the incorrectly labeled instances for a particular LF, or optionally a sample of the incorrect instances.Describe the solution you'd like
A new method added to
LFAnalysis
. This could be calledlf_incorrect
. It would need to take in your data_points and corresponding Y. It would then return the instances from datapoints that do not correspond to Y. Since all the other lf methods work for each LF, I think this could return a dictionary mapping LF names to their incorrectly labeled instances. If large datasets with a lot of incorrect instances are a concern, I could add an optional parameter “max_instances” to return.Additional context
This is something I would be looking to submit a PR for.