monarch-initiative / phenol

phenol: Phenotype ontology library
https://phenol.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
23 stars 4 forks source link

Load Metazoa.owl #254

Closed julesjacobsen closed 4 years ago

julesjacobsen commented 4 years ago

Need to be able to load this to replace similarity service currently running out of OwlSim2 for the Monarch website. @kshefchek please add more info!

kshefchek commented 4 years ago

also cc @matentzn

IRI: http://purl.obolibrary.org/obo/upheno/metazoa.owl

I've been toying around with a robot command that extracts just the necessary bits for semantic similarity, that would output something like: https://archive.monarchinitiative.org/202002/owl/metazoa-slim.owl

julesjacobsen commented 4 years ago
@Test
void loadMetazoaOwl() {
  Path metazoa = Paths.get("/data/metazoa-slim.owl");
  Ontology ontology = OntologyLoader.loadOntology(metazoa.toFile());
  Set<String> termPrefixes = ontology.getAllTermIds().stream().map(TermId::getPrefix).collect(toSet());
  System.out.println("No. terms: " + ontology.countAllTerms());
  System.out.println("Term prefixes: " + termPrefixes);
}

produces

No. terms: 54262
Term prefixes: [MP, ZP, HP, UPHENO, FBcv, WBPhenotype]

So this looks pretty good for a start!

@kshefchek You also load the annotations to produce the IC and then what similarity metric does the service use?

kshefchek commented 4 years ago

You also load the annotations to produce the IC and then what similarity metric does the service use?

we store the latest annotations here: https://data.monarchinitiative.org/owlsim/data

we do a run to generate the 3 cache files here, and are loaded into the owltools sim server, https://data.monarchinitiative.org/owlsim/

matentzn commented 4 years ago

Hmmmm number of terms.. Seems slightly low. ZP alone should be 30K, + 15 MP, 15 HP.. Can you count a breakdown by ontology?

kshefchek commented 4 years ago

we should compare to http://purl.obolibrary.org/obo/upheno/metazoa.owl to make sure there is no data loss

matentzn commented 4 years ago

I mean, I know your approach skips all the obsolete terms.. So that alone should account for around 6K terms, so maybe its just about right. But yes, we should be sure!

julesjacobsen commented 4 years ago

Looks like there might be a bug in the counts provided (I checked my code, it's using different functions to get the counts). Otherwise @matentzn these are the counts by ontology:

No. terms: 54262
No. non-obsolete terms: 54262
No. obsolete terms: 16

Counts by ontology
==================
MP: 12570
ZP: 26085
HP: 13559
UPHENO: 1
WBPhenotype: 1867
FBcv: 196
matentzn commented 4 years ago

Hmmmm.... @kshefchek I gotta admit these seem too few..

I just double checked MP and HP, and there seem to be a few hundred terms missing..

@julesjacobsen can you run the counting for this: http://purl.obolibrary.org/obo/upheno/metazoa.owl

julesjacobsen commented 4 years ago

@matentzn that file looks to be just a header... but it does seem to magically load data!

matentzn commented 4 years ago

Ah you dont use owlapi.. Compare against this then:

https://archive.monarchinitiative.org/202002/owl/metazoa.owl

julesjacobsen commented 4 years ago

Doesn't look so good

11:52:51.985 [main] WARN  uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003001 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003001>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003001>))]
11:52:51.985 [main] WARN  uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003000 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003000>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003000>))]
11:52:52.489 [main] INFO  org.semanticweb.owlapi.io.AbstractOWLParser - URL connection input stream is compressed using gzip
11:52:52.862 [main] WARN  uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003001 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003001>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003001>))]
11:52:52.863 [main] WARN  uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003000 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003000>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003000>))]
No. terms: 3209
No. non-obsolete terms: 3209
No. obsolete terms: 90
VT 3299

What's up with that?

matentzn commented 4 years ago

This is impossible.. if you just wget the file and look at it in a text editor, it has so much more than just VT terms!

julesjacobsen commented 4 years ago

Using https://archive.monarchinitiative.org/202002/owl/metazoa.owl

No. terms: 128080
No. non-obsolete terms: 128080
No. obsolete terms: 4840
ENVO 15
PR 944
CARO 24
owl 1
HP 14832
PATO 1076
BFO 20
UPHENO 7
FlyBase 7
FBcv 201
MPATH 414
NCBIGene 1
ZFA 3168
CHEBI 2700
OBI 1
UBERON 17124
NBO 695
SO 5
OBO 215
MP 13377
ZFS 46
GO 16902
CL 2535
NCBITaxon 104
FBdv 125
FBbt 16856
OIO 1
WBbt 223
IAO 6
ZP 35372
ZFIN 14
PCO 9
RO 3
VT 3299
WBPhenotype 2598
matentzn commented 4 years ago

corrrrect! yes. This kinda suggests that the slim is not quite there yet.. We will go back to this! Thanks @julesjacobsen ! I will work this out with @kshefchek when he is back at work.

julesjacobsen commented 4 years ago

Cool - good to be able to confirm things are broken/working and to get started on this. Thank you both for you help @matentzn and @kshefchek.

kshefchek commented 4 years ago

are any of the missing terms descendants of UPHENO:0001001? If not, these are intentionally removed

matentzn commented 4 years ago

I think there may be quite a few relatively important classes that are currently not children of UPHENO:0001001.. Not in uPheno2, but uPheno 1 was a bit of a mess in this regard.. Given the namespace analysis that Jules provided, the first that come to mind are all these PHENOTYPE terms, like GO_0001PHENOTYPE. But yes you are right. forgot about that. the numbers for MP and HP are ok, given that we only care about phenotype terms.

kshefchek commented 4 years ago

for the purposes of sem sim these classes are not really useful, unless we are rooting with owl:Thing

matentzn commented 4 years ago

Point taken.

pnrobinson commented 4 years ago

@julesjacobsen I am going to release 1.4.2 next week. Do we want to address this before then? I am unsure what the remaining action item is?

julesjacobsen commented 4 years ago

This is sort of a discussion with an eye to a new service using Phenol to replace OwlSim2 for the Monarch compare API. So it's not really an issue, but its also not completely finished. Can we just leave it here as keeping it open will annoy people so that it hopefully won't fall by the wayside and be forgotten. Part one, being able to load metazoa.owl is successful from the phenol perspective.

matentzn commented 4 years ago

Note that metazoa.owl is going to be replaced soon by a specific uPheno 2 profile.

pnrobinson commented 4 years ago

@matentzn @julesjacobsen can this be closed? I think we do not want to parse owl files with phenol? (Out of scope). Or are the open action items?

kshefchek commented 4 years ago

you can use robot to run elk -> sparql -> 2 column tsv (A subClassOf B). I did this for my python sem sim and it worked out pretty well.

pnrobinson commented 4 years ago

Closing, thanks @matentzn