LFAnalysis - Get Incorrect Instances

DavidKoleczek commented 4 years ago

Is your feature request related to a problem? Please describe.

When using LFAnalysis and the lf_summary method I often find myself wondering what the incorrect instances for a particular labeling function actually are. It would be useful to have a way to return all the incorrectly labeled instances for a particular LF, or optionally a sample of the incorrect instances.

Describe the solution you'd like

A new method added to LFAnalysis. This could be called lf_incorrect. It would need to take in your data_points and corresponding Y. It would then return the instances from datapoints that do not correspond to Y. Since all the other lf methods work for each LF, I think this could return a dictionary mapping LF names to their incorrectly labeled instances. If large datasets with a lot of incorrect instances are a concern, I could add an optional parameter “max_instances” to return.

Additional context

This is something I would be looking to submit a PR for.

bhancock8 commented 4 years ago

Great idea! Similar functionality exists in the get_label_buckets method under snorkel/analysis/error_analysis: https://github.com/HazyResearch/snorkel/blob/e316d5700cbfd2243c0d5485537ef310fc0e7a1e/snorkel/analysis/error_analysis.py#L9. To use it, you would pass a gold labels vector and an LF labels vector, and that will return different error buckets you could pull from to get the indices of the corresponding data points where the LF was incorrect. If you wanted to submit a PR that wraps that method and has the functionality you described, that'd be great! You could likely stick it in that same error_analysis file.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

snorkel-team / snorkel