monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

retype "unknown" gene classes #36

Closed nlwashington closed 9 years ago

nlwashington commented 9 years ago

Some of the "genes" in the ncbi gene source indicate that it's type is "unknown". right now, i default to "SO:gene", but this should probably just default to "SO:genomic_feature" (as in, make it more generic). this assumption of it being a gene was incorrect because some of the unknowns are actually in a "phenotype only" class, but that information is not provided in the file.

cmungall commented 9 years ago

Is it strictly wrong to model as genes? The phenotype-only class could be construed as representing a gene with an equivalence to an as-yet unknown genomically characterized gene. Or perhaps some may turn out to be regulatory regions etc?

By modeling as genes, we inflate gene counts, which could be confusing

nlwashington commented 9 years ago

this was spurred because of this: http://www.ncbi.nlm.nih.gov/gene/100885788

but it is clearly a disease with a known causative gene (or in this case a locus). i have wavered on what to do here, because there are many things that are classified as unknowns in NCBI, but i basically think it's their catchall for "we haven't reviewed these".

but making this one a gene ends up contaminating the graph... this node ends up being a disease and a gene, but it definitely isn't.

selewis commented 9 years ago

Hmm, SO doesn't include "locus" as. Nearest I could identify was linkage_group for which the definition is a match for the definition of "locus".

On Wed, Jan 21, 2015 at 2:30 PM, Nicole Washington <notifications@github.com

wrote:

this was spurred because of this: http://www.ncbi.nlm.nih.gov/gene/100885788

but it is clearly a disease with a known causative gene (or in this case a locus). i have wavered on what to do here, because there are many things that are classified as unknowns in NCBI, but i basically think it's their catchall for "we haven't reviewed these".

but making this one a gene ends up contaminating the graph... this node ends up being a disease and a gene, but it definitely isn't.

— Reply to this email directly or view it on GitHub https://github.com/monarch-initiative/dipper/issues/36#issuecomment-70935306 .

nlwashington commented 9 years ago

fyi, i refactored the pg fetcher in 31aac26; this runs faster, esp when the data is already downloaded.