monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

Yeast genes not loaded in dipper due to NCBI using taxon ID of strain #464

Open cmungall opened 7 years ago

cmungall commented 7 years ago

We have no yeast gene orthology info, e.g

https://api.monarchinitiative.org/api/bioentity/gene/SGD%3AS000003865/homologs/?rows=20&fetch_objects=true

It is in the panther species list: https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/Panther.py#L311

The orthology info is loaded into scigraph, e.g.

https://scigraph-data.monarchinitiative.org/scigraph/graph/neighbors/SGD%3AS000003865?depth=1&blankNodes=false&relationshipType=RO%3AHOM0000017&direction=BOTH&entail=false&project=*

but note the gene nodes have no labels

{
      "id": "SGD:S000003865",
      "lbl": null,
      "meta": {
        "types": [
          "cliqueLeader",
          "Node",
          "Class"
        ]
      }
    },

golr loader effectively ignores the orthology triple, as there is no metadata about the gene (I assume)

@selewis - the temporary workaround for you is to have the /homology route in biolink query scigraph (see the PR we just put in for doing the uniprot lookup for an example of how this works). However, the orthologs will still lack labels which is annoying.

The overall fix is to load all core model organism genes in dipper (but this may take some time to percolate to solr, hence the quick fix above)

cmungall commented 7 years ago

Looks like the config lives in jenkins:

http://ci.monarchinitiative.org/job/build-ncbi-ttl/configure

python dipper.py --sources ncbigene --taxon 28377,3702,9913,6239,9615,9031,7955,44689,7227,9796,9606,9544,13616,10090,9258,9598,9823,10116,4896,31033,8364,4932,9685

is this how we want to be doing it? I'd rather have the config under version control (though can be useful to override).

However, in any case my hypothesis is false. 4932 (Scer) is in the list above. So where did the genes go? @kshefchek ? @TomConlin ?

cmungall commented 7 years ago

It would be great if https://data.monarchinitiative.org/ttl/ncbigene_dataset.ttl had metadata about how the run was configured

cmungall commented 7 years ago

Hmm, I downloaded ncbitgene.ttl, I can't find NCBIGene:853568 (ie yeast SOD1 https://www.ncbi.nlm.nih.gov/gene/853568) in the file

kshefchek commented 7 years ago

Looks like this ID comes from panther, golr ignores it because it is not typed as a gene.

cmungall commented 7 years ago

Hmm, mysterious

$ grep RO_0002162 ~/Downloads/ncbigene.ttl  | sort -u
    OBO:RO_0002162 OBO:NCBITaxon_10090 ;
    OBO:RO_0002162 OBO:NCBITaxon_10116 ;
    OBO:RO_0002162 OBO:NCBITaxon_13616 ;
    OBO:RO_0002162 OBO:NCBITaxon_28377 ;
    OBO:RO_0002162 OBO:NCBITaxon_31033 ;
    OBO:RO_0002162 OBO:NCBITaxon_3702 ;
    OBO:RO_0002162 OBO:NCBITaxon_44689 ;
    OBO:RO_0002162 OBO:NCBITaxon_4896 ;
    OBO:RO_0002162 OBO:NCBITaxon_4932 ;
    OBO:RO_0002162 OBO:NCBITaxon_6239 ;
    OBO:RO_0002162 OBO:NCBITaxon_7227 ;
    OBO:RO_0002162 OBO:NCBITaxon_7955 ;
    OBO:RO_0002162 OBO:NCBITaxon_8364 ;
    OBO:RO_0002162 OBO:NCBITaxon_9031 ;
    OBO:RO_0002162 OBO:NCBITaxon_9258 ;
    OBO:RO_0002162 OBO:NCBITaxon_9544 ;
    OBO:RO_0002162 OBO:NCBITaxon_9598 ;
    OBO:RO_0002162 OBO:NCBITaxon_9606 ;
    OBO:RO_0002162 OBO:NCBITaxon_9615 ;
    OBO:RO_0002162 OBO:NCBITaxon_9685 ;
    OBO:RO_0002162 OBO:NCBITaxon_9796 ;
    OBO:RO_0002162 OBO:NCBITaxon_9823 ;
    OBO:RO_0002162 OBO:NCBITaxon_9913 ;

So 4932 is there... BUT

$ grep RO_0002162 ~/Downloads/ncbigene.ttl  | grep NCBITaxon_4932  | wc
      36     108    1440

Oh dear!

cmungall commented 7 years ago

It looks like we only have mitochondrial

<http://www.ncbi.nlm.nih.gov/gene/24573117> a owl:Class ;
    rdfs:label "atp8" ;
    OBO:RO_0002162 OBO:NCBITaxon_4932 ;
    OBO:RO_0002525 OBO:CHR_4932chrMT ;
    dc:description "Atp8" ;
    rdfs:subClassOf OBO:SO_0001217 .

--
--
<http://www.ncbi.nlm.nih.gov/gene/24573142> a owl:Class ;
    rdfs:label "cox3" ;
    OBO:RO_0002162 OBO:NCBITaxon_4932 ;
    OBO:RO_0002525 OBO:CHR_4932chrMT ;
    dc:description "cytchrome c oxidase sububnit 3" ;
    rdfs:subClassOf OBO:SO_0001217 .

--
--
<http://www.ncbi.nlm.nih.gov/gene/24573116> a owl:Class ;
    rdfs:label "atp9" ;
    OBO:RO_0002162 OBO:NCBITaxon_4932 ;
    OBO:RO_0002525 OBO:CHR_4932chrMT ;
    dc:description "Atp9" ;
    rdfs:subClassOf OBO:SO_0001217 .
...
cmungall commented 7 years ago

AH HA! https://www.ncbi.nlm.nih.gov/gene/853568 click on taxon

They use the strain ID (except for the mt.. go figure) https://www.ncbi.nlm.nih.gov/taxonomy?LinkName=gene_taxonomy&from_uid=853568