Closed julesjacobsen closed 4 years ago
also cc @matentzn
IRI: http://purl.obolibrary.org/obo/upheno/metazoa.owl
I've been toying around with a robot command that extracts just the necessary bits for semantic similarity, that would output something like: https://archive.monarchinitiative.org/202002/owl/metazoa-slim.owl
@Test
void loadMetazoaOwl() {
Path metazoa = Paths.get("/data/metazoa-slim.owl");
Ontology ontology = OntologyLoader.loadOntology(metazoa.toFile());
Set<String> termPrefixes = ontology.getAllTermIds().stream().map(TermId::getPrefix).collect(toSet());
System.out.println("No. terms: " + ontology.countAllTerms());
System.out.println("Term prefixes: " + termPrefixes);
}
produces
No. terms: 54262
Term prefixes: [MP, ZP, HP, UPHENO, FBcv, WBPhenotype]
So this looks pretty good for a start!
@kshefchek You also load the annotations to produce the IC and then what similarity metric does the service use?
You also load the annotations to produce the IC and then what similarity metric does the service use?
we store the latest annotations here: https://data.monarchinitiative.org/owlsim/data
we do a run to generate the 3 cache files here, and are loaded into the owltools sim server, https://data.monarchinitiative.org/owlsim/
Hmmmm number of terms.. Seems slightly low. ZP alone should be 30K, + 15 MP, 15 HP.. Can you count a breakdown by ontology?
we should compare to http://purl.obolibrary.org/obo/upheno/metazoa.owl to make sure there is no data loss
I mean, I know your approach skips all the obsolete terms.. So that alone should account for around 6K terms, so maybe its just about right. But yes, we should be sure!
Looks like there might be a bug in the counts provided (I checked my code, it's using different functions to get the counts). Otherwise @matentzn these are the counts by ontology:
No. terms: 54262
No. non-obsolete terms: 54262
No. obsolete terms: 16
Counts by ontology
==================
MP: 12570
ZP: 26085
HP: 13559
UPHENO: 1
WBPhenotype: 1867
FBcv: 196
Hmmmm.... @kshefchek I gotta admit these seem too few..
I just double checked MP and HP, and there seem to be a few hundred terms missing..
@julesjacobsen can you run the counting for this: http://purl.obolibrary.org/obo/upheno/metazoa.owl
@matentzn that file looks to be just a header... but it does seem to magically load data!
Ah you dont use owlapi.. Compare against this then:
https://archive.monarchinitiative.org/202002/owl/metazoa.owl
Doesn't look so good
11:52:51.985 [main] WARN uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003001 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003001>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003001>))]
11:52:51.985 [main] WARN uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003000 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003000>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003000>))]
11:52:52.489 [main] INFO org.semanticweb.owlapi.io.AbstractOWLParser - URL connection input stream is compressed using gzip
11:52:52.862 [main] WARN uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003001 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003001>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003001>))]
11:52:52.863 [main] WARN uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl - Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0003000 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0003000>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0003000>))]
No. terms: 3209
No. non-obsolete terms: 3209
No. obsolete terms: 90
VT 3299
What's up with that?
This is impossible.. if you just wget the file and look at it in a text editor, it has so much more than just VT terms!
Using https://archive.monarchinitiative.org/202002/owl/metazoa.owl
No. terms: 128080
No. non-obsolete terms: 128080
No. obsolete terms: 4840
ENVO 15
PR 944
CARO 24
owl 1
HP 14832
PATO 1076
BFO 20
UPHENO 7
FlyBase 7
FBcv 201
MPATH 414
NCBIGene 1
ZFA 3168
CHEBI 2700
OBI 1
UBERON 17124
NBO 695
SO 5
OBO 215
MP 13377
ZFS 46
GO 16902
CL 2535
NCBITaxon 104
FBdv 125
FBbt 16856
OIO 1
WBbt 223
IAO 6
ZP 35372
ZFIN 14
PCO 9
RO 3
VT 3299
WBPhenotype 2598
corrrrect! yes. This kinda suggests that the slim is not quite there yet.. We will go back to this! Thanks @julesjacobsen ! I will work this out with @kshefchek when he is back at work.
Cool - good to be able to confirm things are broken/working and to get started on this. Thank you both for you help @matentzn and @kshefchek.
are any of the missing terms descendants of UPHENO:0001001? If not, these are intentionally removed
I think there may be quite a few relatively important classes that are currently not children of UPHENO:0001001.. Not in uPheno2, but uPheno 1 was a bit of a mess in this regard.. Given the namespace analysis that Jules provided, the first that come to mind are all these PHENOTYPE terms, like GO_0001PHENOTYPE. But yes you are right. forgot about that. the numbers for MP and HP are ok, given that we only care about phenotype terms.
for the purposes of sem sim these classes are not really useful, unless we are rooting with owl:Thing
Point taken.
@julesjacobsen I am going to release 1.4.2 next week. Do we want to address this before then? I am unsure what the remaining action item is?
This is sort of a discussion with an eye to a new service using Phenol to replace OwlSim2 for the Monarch compare API. So it's not really an issue, but its also not completely finished. Can we just leave it here as keeping it open will annoy people so that it hopefully won't fall by the wayside and be forgotten. Part one, being able to load metazoa.owl is successful from the phenol perspective.
Note that metazoa.owl is going to be replaced soon by a specific uPheno 2 profile.
@matentzn @julesjacobsen can this be closed? I think we do not want to parse owl files with phenol? (Out of scope). Or are the open action items?
you can use robot to run elk -> sparql -> 2 column tsv (A subClassOf B). I did this for my python sem sim and it worked out pretty well.
Closing, thanks @matentzn
Need to be able to load this to replace similarity service currently running out of OwlSim2 for the Monarch website. @kshefchek please add more info!