Writing "Good" Labeling functions

danich1 commented 7 years ago

Our aim is to generate useful labeling functions from a given set of candidate sentences provided below:

dhimmel commented 7 years ago

We'll know more about this issue once we start writing labeling functions.

Some background, for each relationship type we'll be starting with a knowledgebase (gold standard) of known relationships. These will generally be a relationship type from Hetionet. So the first two labeling functions will be:

return 1 if the relationship is in the gold standard
return -1 if the relationship is not in the gold standard

Then we will have to make additional labeling functions to refine the classifier. We're hoping to parallelize this task to some degree, i.e. everyone involved can submit additional labeling functions. So we'll have to develop a framework that allows anyone to submit labeling functions.

And it's our impression that snorkel will be able to evaluate the quality of each labeling function? So it's not the end of the world if some of our labeling functions are imperfect.

danich1 commented 7 years ago

Below are examples of the desired and undesired Disease-Gene candidate relationships we will be working with.

Good Example:

PATIENT: We describe a male infant with early infantile epileptic encephalopathy with suppression-burst (Ohtahara syndrome) who carried a de novo 2.0-Mb microdeletion in chromosome 9q33q34, including STXBP1.

In the quote above, the Disease-Gene candidate relation is in bold. This is a good example because the relationship is in our gold standard list, so it would receive a +1.

Bad Example:

Xq28, which includes MECP2 is the major locus for submicroscopic X-chromosome duplications, whereas duplications in Xq25 and Xq26 have been reported in only a few cases.

This is a bad example because the disease in this context has nothing to do with epilepsy, so this relationship would receive a -1.

greenelab / snorkeling

Writing "Good" Labeling functions #8

Good Example:

Bad Example: