monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Ontology and or ontology link error #741

Closed kltm closed 5 years ago

kltm commented 5 years ago

How to reproduce

Expected results: I'm guessing that there should be no "PHENOTYPE" parallel construction for the GO available to the user?

nathandunn commented 5 years ago

I had run into this earlier. I think this is a dipper / ingest error.

kltm commented 5 years ago

Feel free to move whatever tracker is appropriate.

TomConlin commented 5 years ago

confirm GO:0022008PHENOTYPE is via the dipper GO ingest

likely here: https://github.com/monarch-initiative/dipper/blob/master/dipper/sources/GeneOntology.py#L323

Cannot speak to why.

this issue could be moved to dipper

kshefchek commented 5 years ago

These are upheno grouping classes, but I believe they will be deprecated

matentzn commented 5 years ago

I am very interested in this going away as well, and I think we had discussed it in one of the previous data calls.. There should be 'real' phenotype terms for most, if not all of these cases;

Can someone make a list of ingest scripts where these are generated or sources where these come from?

TomConlin commented 5 years ago

Here is what I find, the blank node are neither here nor there the links to nowhere at OBO would be irritating to me if I was running OBO.

perhaps

##############################
ntriples/flybase.nt:273649
  RDF_SUBJECTS
  ------------  
  6416 <http://purl.obolibrary.org/obo    
        5798    FBbt_
          616     GO_
            2       SO_

  5294 <https://monarchinitiative.org/.well-known

  RDF_OBJECTS
  -----------
 248215 <http://purl.obolibrary.org/obo
         228550 FBbt
          19465 GO
            200 SO

  13724 <https://monarchinitiative.org/.well-known

########################
ntriples/go.nt:82180
  RDF_SUBJECTS
  ------------ 
    0 (zero)

  RDF_OBJECTS
  -----------  
  82180 <http://purl.obolibrary.org/obo
        82180 GO_

###########################
ntriples/monarch.nt:76
  RDF_SUBJECTS
  ------------ 
    0 (zero)

  RDF_OBJECTS
  -----------  
  76 <http://purl.obolibrary.org/obo
      6 CL
     22 GO_
      2 MPATH_
     10 NBO_
     36 UBERON_
matentzn commented 5 years ago

Thanks Tom, could you tell me what exactly you ran to create this output? Does this mean that there are 5798 FBBT123PHENOTYPE classes? It would make some sense (flybase curates anatomical phenotypes against their anatomy ontology only), but just wanted to be sure! Thanks.

TomConlin commented 5 years ago

sure shell commands . have some history:

cd Dev/NTriples_201901/
fgrep -c "PHENOTYPE> " ntriples/*.nt
cut -f1 -d' ' ntriples/flybase.nt |fgrep "PHENOTYPE>"| head
cut -f1 -d' ' ntriples/flybase.nt |fgrep "PHENOTYPE>"|cut -f1-4 -d '/'|sort|uniq -c|sort -nr
cut -f3 -d' ' ntriples/flybase.nt |fgrep "PHENOTYPE>"|cut -f1-4 -d '/'|sort|uniq -c|sort -nr
cut -f1 -d' ' ntriples/go.nt      |fgrep "PHENOTYPE>"|cut -f1-4 -d '/'|sort|uniq -c|sort -nr
cut -f3 -d' ' ntriples/go.nt      |fgrep "PHENOTYPE>"|cut -f1-4 -d '/'|sort|uniq -c|sort -nr
cut -f3 -d' ' ntriples/monarch.nt |fgrep "PHENOTYPE>"|cut -f1-4 -d '/'|sort|uniq -c|sort -nr
cut -f1 -d' ' ntriples/monarch.nt |fgrep "PHENOTYPE>"|cut -f1-4 -d '/'|sort|uniq -c|sort -nr
cut -f1 -d' ' ntriples/flybase.nt|grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|less
cut -f1 -d' ' ntriples/flybase.nt|grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|cut -f5 -d \/|head
cut -f1 -d' ' ntriples/flybase.nt|grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|cut -f5 -d \/|cut -f1 -d \_|head
cut -f1 -d' ' ntriples/flybase.nt|grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|cut -f5 -d \/|cut -f1 -d \_|sort|uniq -c
cut -f3 -d' ' ntriples/go.nt     |grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|cut -f5 -d \/|cut -f1 -d \_|sort|uniq -c
cut -f3 -d' ' ntriples/monarch.nt|grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|cut -f5 -d \/|cut -f1 -d \_|sort|uniq -c
cut -f3 -d' ' ntriples/flybase.nt|grep "<http://purl.obolibrary.org/obo/.*PHENOTYPE>"|cut -f5 -d \/|cut -f1 -d \_|sort|uniq -c

then paste the less crufty bits into the lovingly crafted ticket

matentzn commented 5 years ago

@kshefchek

Do these generated classes like FBBT123PHENOTYPE somehow make it into OWLSIM? Would monarch (phenogrid) recognise FBBT:HEADPHENOTYPE to be similar to HP:abnormal HEAD?

kshefchek commented 5 years ago

They do make it into owlsim, and can be compared, for example:

https://monarchinitiative.org/owlsim/compareAttributeSets?a=FBbt:00000004PHENOTYPE&b=HP:0000234

0 for phenodigm, .28 jaccard sim

https://monarchinitiative.org/owlsim/searchByAttributeSet?a=FBbt:00000004PHENOTYPE

Theres not much in terms of connections outside of fly genes.

matentzn commented 5 years ago

Wow, this is surprising. I mean, in order to have a jaccard of 0.28, it must share some superclasses.. I cant see at the moment where owlsim would get them from? Is there any way to see the shared superclasses from the Jaccard result?

kshefchek commented 5 years ago

The following classes are in common:

p.iri p.label
"http://purl.obolibrary.org/obo/UBERON_0000033PHENOTYPE" "head phenotype"
"http://purl.obolibrary.org/obo/UBERON_0007811PHENOTYPE" "craniocervical region phenotype"
matentzn commented 5 years ago

Thanks! :) No rush, but if you can point me to the code that ensures that both FBbt:00000004PHENOTYPE & HP:0000234 are subclasses of http://purl.obolibrary.org/obo/UBERON_0007811PHENOTYPE; that would help. this is really a big surprise to me!

kshefchek commented 5 years ago

This is coming from scigraph which creates convenience edges during load time. It may not be reflective of what is in owlsim. The owlsim makefile uses http://archive.monarchinitiative.org/201902/owl/metazoa.owl, see https://github.com/monarch-initiative/monarch-owlsim-data/blob/master/server/Makefile

It's also possible the jaccard sim implementation includes the parent UPHENO phenotype class and that's what we're seeing

monicacecilia commented 5 years ago

... aaaaand? I'm at the edge of my seat. What happens next? 😃

matentzn commented 5 years ago

From my standpoint, we will do the following: whenever Monarch is slurping post composed phenotypes, we have a separate repo (like in the case of ZP) that maps the post composed annotations into a pre-coordinated vocabulary in a standard, transparent way. Ideally, this would happen in collaboration with the mods, but in cases where this is not possible, we need to think about IRIs and membership in uPheno. For example, FlyBase will curate an abnormal head as FBBT:001 (Head). Now this is not a phenotype term, it is an anatomical entity. So to be clean with our conceptual model, we have so far generated these ominous FBBT:001PHENOTYPE classes. To provide the classification, someone, probably @cmungall , created this ontology here. This is roughly what should have happened, but the above ontology does not use EQ; it will therefore not neatly fall into our general framework. So the question we need to answer now is:

1) Lets say FlyBase does not have an interest in integrating a vocabulary on abnormal anatomical entities, will we provide a stable id that we intent people to be used? So for example, single cell atlas may want to record an abnormal head (fly) phenotype; is it our intention they use our term? 2) our PHENOTYPE classes have been out in the wild now for a while.. :/ Should we continue to use this URI scheme (adding the word PHENOTYPE at the end) to stay backwards compatible? Do we mass deprecate? Do we just silently make them dissappear (as they were never meant to be used in the first place)?

kshefchek commented 5 years ago

We determined that this is coming from uPheno and not dipper, so proposing we close or move the ticket.

monicacecilia commented 5 years ago

Ok. But the problem persists, so I'd like to still have an open ticket on the UI Project. Dear @kshefchek @matentzn @kltm - could one of you please open a new ticket in uPheno that includes this information, and link here as well? Thank you!

kshefchek commented 5 years ago

Sure thing, I've made a ticket here - https://github.com/obophenotype/upheno/issues/521

From a UI perspective, we could have a blacklist of curie prefixes where we know the purls go no where, and not link out to them.

EDIT: actually this wouldn't work here, because the prefix is GO

We're actually doing better on beta than production, which has two broken URLs, one which 404s, https://monarchinitiative.org/phenotype/GO:0030424PHENOTYPE