Open danich1 opened 7 years ago
We'll know more about this issue once we start writing labeling functions.
Some background, for each relationship type we'll be starting with a knowledgebase (gold standard) of known relationships. These will generally be a relationship type from Hetionet. So the first two labeling functions will be:
return 1
if the relationship is in the gold standardreturn -1
if the relationship is not in the gold standardThen we will have to make additional labeling functions to refine the classifier. We're hoping to parallelize this task to some degree, i.e. everyone involved can submit additional labeling functions. So we'll have to develop a framework that allows anyone to submit labeling functions.
And it's our impression that snorkel will be able to evaluate the quality of each labeling function? So it's not the end of the world if some of our labeling functions are imperfect.
Below are examples of the desired and undesired Disease-Gene candidate relationships we will be working with.
PATIENT: We describe a male infant with early infantile epileptic encephalopathy with suppression-burst (Ohtahara syndrome) who carried a de novo 2.0-Mb microdeletion in chromosome 9q33q34, including STXBP1.
In the quote above, the Disease-Gene candidate relation is in bold. This is a good example because the relationship is in our gold standard list, so it would receive a +1.
Xq28, which includes MECP2 is the major locus for submicroscopic X-chromosome duplications, whereas duplications in Xq25 and Xq26 have been reported in only a few cases.
This is a bad example because the disease in this context has nothing to do with epilepsy, so this relationship would receive a -1.
Our aim is to generate useful labeling functions from a given set of candidate sentences provided below: