Closed caufieldjh closed 2 years ago
Output of newly added script get_all_transform_stats.sh
:
All processed ontologies:
910
All successful JSON transforms:
896
All successful KGX TSV transforms:
888
Transforms with at least one of the following errors:
MISSING_NODE_PROPERTY
0
MISSING_EDGE_PROPERTY
0
INVALID_NODE_PROPERTY
833
INVALID_EDGE_PROPERTY
796
INVALID_NODE_PROPERTY_VALUE_TYPE
28
INVALID_NODE_PROPERTY_VALUE
833
INVALID_EDGE_PROPERTY_VALUE_TYPE
0
INVALID_EDGE_PROPERTY_VALUE
796
MISSING_CATEGORY
0
INVALID_CATEGORY
888
Category 'OntologyClass' is a mixin in the Biolink Model
888
MISSING_EDGE_PREDICATE
0
INVALID_EDGE_PREDICATE
495
MISSING_NODE_CURIE_PREFIX
0
DUPLICATE_NODE
0
MISSING_NODE
0
INVALID_EDGE_TRIPLE
0
VALIDATION_SYSTEM_ERROR
0
The big take-home here is that entities in every transform gets assigned biolink:OntologyClass
despite Biolink modeling OntologyClass
as a class mixin rather than intending it to be a class type itself.
Do we know enough about each ontology to assign a mode specific class to nodes?
There are other metaclasses, like [biolink:TaxonomicRank](https://w3id.org/biolink/vocab/TaxonomicRank)
- these may still make sense to use in some contexts.
Finding appropriate mappings vs. Biolink is a goal for kgx
- that will help to reduce the number of OntologyClass nodes.
Completely failed transforms: | ID | Name | Issue |
---|---|---|---|
NIFSTD | Neuroscience Information Framework (NIF) Standard Ontology | #15 | |
EXACT | An ontology for experimental actions | #15 ; Small, alpha status, last uploaded 2014 | |
DOID | Human Disease Ontology | #15 ; Unknown - would really expect this to work | |
ECOCORE | An ontology of core ecological entities | Empty? Error in Bioportal? Last updated Mar 10 2022 | |
ETHIOPIADISEASES | EthiopiaDiseaseList | Empty? Does not render on Bioportal | |
LC-CARRIERS | Library of Congress Carriers Scheme | Empty? In SKOS format; does not render on Bioportal | |
SCDO | Sickle Cell Disease Ontology | #15 | |
FENICS | Functional Epilepsy Nomenclature for Ion Channels | #15 ; Using webprotege: prefix (unsure if related to transform fail) | |
FOVT | FuTRES Ontology of Vertebrate Traits | #15 | |
PTRANS | Pathogen Transmission Ontology | #15 ; Does not render on Bioportal | |
TIMEBANK | Timebank Ontology | #15 | |
GSSO | Gender, Sex, and Sexual Orientation Ontology | #15 | |
CST | Cancer Staging Terms | Unknown ; Does not render on Bioportal | |
MARC-RELATORS | MARC Code List for Relators | Empty? Does not render on Bioportal |
Transforms translating to Obojson but not to KGX TSV: | ID | Name | Issue |
---|---|---|---|
PDRO | The Prescription of Drugs Ontology | Unknown CURIE prefix: file |
|
VICO | Vaccination Informed Consent Ontology | Unknown CURIE prefix: file |
|
IXNO | Interaction Ontology | Last updated in 2011; Unknown CURIE prefix: file |
|
IDQA | Image and Data Quality Assessment Ontology | Unknown CURIE prefix: file |
|
KTAO | Kidney Tissue Atlas Ontology | Unknown CURIE prefix: file |
|
GAZ | Gazetteer | Unknown CURIE prefix: file ; KG-OBO transforms GAZ w/o issue, see https://kg-hub.berkeleybop.io/kg-obo/gaz/no_version/ |
|
CANONT | Upper-Level Cancer Ontology | Last updated in 2012; Unknown CURIE prefix: file |
These are generally issues with the OBONamespace set to a local file path, and in at least one case (VICO) it's because of references to another namespace beginning with file:
(GAZ).
See #23 for Unknown CURIE prefix: file
issue.
With issues #15 and #23 resolved, the only remaining problematic transforms are:
ECOCORE has a new version on BioPortal - can just use this for now: https://bioportal.bioontology.org/ontologies/ECOCORE/?p=summary
Can drop LC-CARRIERS and MARC-RELATORS.
Hi Harry.
ETHIOPIADISEASES
The latest submission in our system was corrupt. I recreated/reprocessed the submission so that the ontology is accessible again:
https://bioportal.bioontology.org/ontologies/ETHIOPIADISEASES?p=summary
CST
It looks like the end user uploaded an ontology source file for this entry, but we were never able to load the data into the triplestore, because our code errors out when we try to serialize to RDF/XML format with the following error:
org.semanticweb.owlapi.rdf.rdfxml.renderer.IllegalElementNameException: Illegal Element Name (Element Is Not A QName): http://www.w3.org/2000/01/rdf-schema#comment:
I think this one could probably be dropped for now.
Great - thanks @jvendetti !
Hi @caufieldjh. It turns out that the maintainers of ETHIOPIADISEASE told John that they no longer need this entry in BioPortal. I had originally reprocessed it, but I've now deleted the entry.
Great, thanks! One more off the list.
Updated statistics, including for types:
*** General ontology counts:
All processed ontologies: 910
All successful JSON transforms: 906
All successful KGX TSV transforms: 903
All transforms with KGX validation logs: 902
All transforms with ROBOT measure reports: 883
All transforms with ROBOT validation reports: 904
Ontologies with failed transforms:
./transformed/ontologies/ETHIOPIADISEASES
./transformed/ontologies/LC-CARRIERS
./transformed/ontologies/CST
*** Transforms with at least one of the following errors:
MISSING_NODE_PROPERTY 0
MISSING_EDGE_PROPERTY 0
INVALID_NODE_PROPERTY 844
INVALID_EDGE_PROPERTY 807
INVALID_NODE_PROPERTY_VALUE_TYPE 31
INVALID_NODE_PROPERTY_VALUE 844
INVALID_EDGE_PROPERTY_VALUE_TYPE 0
INVALID_EDGE_PROPERTY_VALUE 807
MISSING_CATEGORY 0
INVALID_CATEGORY 902
Category 'OntologyClass' is a mixin in the Biolink Model 902
MISSING_EDGE_PREDICATE 0
INVALID_EDGE_PREDICATE 502
MISSING_NODE_CURIE_PREFIX 0
DUPLICATE_NODE 0
MISSING_NODE 0
INVALID_EDGE_TRIPLE 0
VALIDATION_SYSTEM_ERROR 0
*** Node type counts:
biolink:NamedThing 731
biolink:OntologyClass 903
biolink:BiologicalProcess 76
biolink:Cell 110
biolink:CellularComponent 46
biolink:ChemicalSubstance 119
biolink:Disease 15
biolink:Event 2
biolink:ExposureEvent 3
biolink:Gene 9
biolink:MolecularActivity 49
biolink:NamedThing 731
biolink:OntologyClass 903
biolink:OrganismalEntity 128
biolink:Pathway 6
biolink:PhenotypicFeature 44
biolink:Protein 79
biolink:SequenceFeature 56
biolink:SexQualifier 1
biolink:Source 2
biolink:TaxonomicRank 3
biolink:Unit 2
biolink:AnatomicalEntity 112
*** Edge type counts (i.e., predicate types):
biolink:related_to 376
biolink:subclass_of 899
biolink:part_of 52
biolink:inverseOf 408
biolink:subPropertyOf 449
biolink:has_part 165
biolink:has_participant 99
biolink:has_unit 29
biolink:preceded_by 69
biolink:has_attribute 76
biolink:positively_regulates 35
biolink:negatively_regulates 37
This includes all node types across all ontologies, and a selection of the more common predicate types. Note that these are largely the result of type assignment by KGX. As expected, nodes with biolink:NamedThing or biolink:OntologyClass are ubiquitous, suggesting that many may be re-assigned to more informative types. Though predicate types appear more consistent, there is a long tail of sparsely-used types (not shown) across all ontologies.
Closing issue as complete - reopen as needed
Specifically: