Closed nlwashington closed 9 years ago
Is it strictly wrong to model as genes? The phenotype-only class could be construed as representing a gene with an equivalence to an as-yet unknown genomically characterized gene. Or perhaps some may turn out to be regulatory regions etc?
By modeling as genes, we inflate gene counts, which could be confusing
this was spurred because of this: http://www.ncbi.nlm.nih.gov/gene/100885788
but it is clearly a disease with a known causative gene (or in this case a locus). i have wavered on what to do here, because there are many things that are classified as unknowns in NCBI, but i basically think it's their catchall for "we haven't reviewed these".
but making this one a gene ends up contaminating the graph... this node ends up being a disease and a gene, but it definitely isn't.
Hmm, SO doesn't include "locus" as. Nearest I could identify was linkage_group for which the definition is a match for the definition of "locus".
On Wed, Jan 21, 2015 at 2:30 PM, Nicole Washington <notifications@github.com
wrote:
this was spurred because of this: http://www.ncbi.nlm.nih.gov/gene/100885788
but it is clearly a disease with a known causative gene (or in this case a locus). i have wavered on what to do here, because there are many things that are classified as unknowns in NCBI, but i basically think it's their catchall for "we haven't reviewed these".
but making this one a gene ends up contaminating the graph... this node ends up being a disease and a gene, but it definitely isn't.
— Reply to this email directly or view it on GitHub https://github.com/monarch-initiative/dipper/issues/36#issuecomment-70935306 .
fyi, i refactored the pg fetcher in 31aac26; this runs faster, esp when the data is already downloaded.
Some of the "genes" in the ncbi gene source indicate that it's type is "unknown". right now, i default to "SO:gene", but this should probably just default to "SO:genomic_feature" (as in, make it more generic). this assumption of it being a gene was incorrect because some of the unknowns are actually in a "phenotype only" class, but that information is not provided in the file.