snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Candidate and Gold Labels not being matched #715

Closed varun-tandon closed 7 years ago

varun-tandon commented 7 years ago

Hi Snorkel Team,

I am one of the interns at the Canary Center working with Gautam and Dr. Mallick on MarkerVille, and am facing some issues with loading gold labels. We discussed debugging methods at the last OH with @stephenbach and we are facing a strange issue where our gold labels contain the StableLabel for a particular extracted candidate; however, Snorkel labels this candidate as -1 and appears to not match the gold label and extracted candidate.

We are loading external annotations (which are already formatted as StableLabels and in a TSV file) in a manner similar to the intro tutorial using the same utils.py file, like so:

from util import load_external_labels %time load_external_labels(session, BiomarkerCondition, annotator_name='gold')

and we then load the gold labels like so:

from snorkel.annotations import load_gold_labels L_gold_dev = load_gold_labels(session, annotator_name='gold', split=1) print L_gold_dev

This provides an output of a numpy array with all elements labelled as -1.

Viewing an individual candidate like so:

print L_gold_dev.get_candidate(session,x)[0].get_stable_id() print L_gold_dev.get_candidate(session,x)[1].get_stable_id()

Provides the following output: 28262798::span:613:617 28262798::span:632:643

And we have confirmed that these StableLabels exist in our external annotations and are labelled as positive 1.

We would greatly appreciate any help you could provide us regarding why Snorkel does not seem to match the external gold label with the extracted candidate and/or any further debugging steps we should take. Please let me know if any more information is needed!

Thanks!

varun-tandon commented 7 years ago

Fixed by deleting snorkel.db

stephenbach commented 7 years ago

Thanks Varun! A little more detail in case anyone else encounters this. A lot of the label loading code is nondestructive (since labeled data is so valuable!), so it won't replace existing labels in the database. The issue was that '-1' labels were already stored from a previous run.