Open cmungall opened 7 years ago
Looks like the config lives in jenkins:
http://ci.monarchinitiative.org/job/build-ncbi-ttl/configure
python dipper.py --sources ncbigene --taxon 28377,3702,9913,6239,9615,9031,7955,44689,7227,9796,9606,9544,13616,10090,9258,9598,9823,10116,4896,31033,8364,4932,9685
is this how we want to be doing it? I'd rather have the config under version control (though can be useful to override).
However, in any case my hypothesis is false. 4932 (Scer) is in the list above. So where did the genes go? @kshefchek ? @TomConlin ?
It would be great if https://data.monarchinitiative.org/ttl/ncbigene_dataset.ttl had metadata about how the run was configured
Hmm, I downloaded ncbitgene.ttl, I can't find NCBIGene:853568 (ie yeast SOD1 https://www.ncbi.nlm.nih.gov/gene/853568) in the file
Looks like this ID comes from panther, golr ignores it because it is not typed as a gene.
Hmm, mysterious
$ grep RO_0002162 ~/Downloads/ncbigene.ttl | sort -u
OBO:RO_0002162 OBO:NCBITaxon_10090 ;
OBO:RO_0002162 OBO:NCBITaxon_10116 ;
OBO:RO_0002162 OBO:NCBITaxon_13616 ;
OBO:RO_0002162 OBO:NCBITaxon_28377 ;
OBO:RO_0002162 OBO:NCBITaxon_31033 ;
OBO:RO_0002162 OBO:NCBITaxon_3702 ;
OBO:RO_0002162 OBO:NCBITaxon_44689 ;
OBO:RO_0002162 OBO:NCBITaxon_4896 ;
OBO:RO_0002162 OBO:NCBITaxon_4932 ;
OBO:RO_0002162 OBO:NCBITaxon_6239 ;
OBO:RO_0002162 OBO:NCBITaxon_7227 ;
OBO:RO_0002162 OBO:NCBITaxon_7955 ;
OBO:RO_0002162 OBO:NCBITaxon_8364 ;
OBO:RO_0002162 OBO:NCBITaxon_9031 ;
OBO:RO_0002162 OBO:NCBITaxon_9258 ;
OBO:RO_0002162 OBO:NCBITaxon_9544 ;
OBO:RO_0002162 OBO:NCBITaxon_9598 ;
OBO:RO_0002162 OBO:NCBITaxon_9606 ;
OBO:RO_0002162 OBO:NCBITaxon_9615 ;
OBO:RO_0002162 OBO:NCBITaxon_9685 ;
OBO:RO_0002162 OBO:NCBITaxon_9796 ;
OBO:RO_0002162 OBO:NCBITaxon_9823 ;
OBO:RO_0002162 OBO:NCBITaxon_9913 ;
So 4932 is there... BUT
$ grep RO_0002162 ~/Downloads/ncbigene.ttl | grep NCBITaxon_4932 | wc
36 108 1440
Oh dear!
It looks like we only have mitochondrial
<http://www.ncbi.nlm.nih.gov/gene/24573117> a owl:Class ;
rdfs:label "atp8" ;
OBO:RO_0002162 OBO:NCBITaxon_4932 ;
OBO:RO_0002525 OBO:CHR_4932chrMT ;
dc:description "Atp8" ;
rdfs:subClassOf OBO:SO_0001217 .
--
--
<http://www.ncbi.nlm.nih.gov/gene/24573142> a owl:Class ;
rdfs:label "cox3" ;
OBO:RO_0002162 OBO:NCBITaxon_4932 ;
OBO:RO_0002525 OBO:CHR_4932chrMT ;
dc:description "cytchrome c oxidase sububnit 3" ;
rdfs:subClassOf OBO:SO_0001217 .
--
--
<http://www.ncbi.nlm.nih.gov/gene/24573116> a owl:Class ;
rdfs:label "atp9" ;
OBO:RO_0002162 OBO:NCBITaxon_4932 ;
OBO:RO_0002525 OBO:CHR_4932chrMT ;
dc:description "Atp9" ;
rdfs:subClassOf OBO:SO_0001217 .
...
AH HA! https://www.ncbi.nlm.nih.gov/gene/853568 click on taxon
They use the strain ID (except for the mt.. go figure) https://www.ncbi.nlm.nih.gov/taxonomy?LinkName=gene_taxonomy&from_uid=853568
We have no yeast gene orthology info, e.g
https://api.monarchinitiative.org/api/bioentity/gene/SGD%3AS000003865/homologs/?rows=20&fetch_objects=true
It is in the panther species list: https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/Panther.py#L311
The orthology info is loaded into scigraph, e.g.
https://scigraph-data.monarchinitiative.org/scigraph/graph/neighbors/SGD%3AS000003865?depth=1&blankNodes=false&relationshipType=RO%3AHOM0000017&direction=BOTH&entail=false&project=*
but note the gene nodes have no labels
golr loader effectively ignores the orthology triple, as there is no metadata about the gene (I assume)
@selewis - the temporary workaround for you is to have the /homology route in biolink query scigraph (see the PR we just put in for doing the uniprot lookup for an example of how this works). However, the orthologs will still lack labels which is annoying.
The overall fix is to load all core model organism genes in dipper (but this may take some time to percolate to solr, hence the quick fix above)