Open alexskr opened 1 year ago
It appears that AG doesn't like this line in the source ontology (I tested the ADMIN ontology):
<http://www.semanticweb.org/philshields/ontologies/2015/4/Administrator.owl> <http://purl.org/dc/elements/1.1/date> "Oct 28, 2013 9:03:53 PM"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
The underlying error is:
MALFORMED DATA: `Oct 28, 2013 9:03:53 PM` is not of type date-time
Parsing TCDO, I get a similar error on line:
<http://OntoTCM.org.cn/ontologies/TCDO> <http://www.geneontology.org/formats/oboInOwl#creation_date> "June 6, 2021"^^<http://www.w3.org/2001/XMLSchema#decimal> .
error:
MALFORMED DATA: `June 6, 2021` is not of type xsd:decimal
This issue is related to: ncbo/goo#111
As I said I would, I took a look at the ontologies listed above, seeing if there were any obvious patterns and how much usage they were getting.
A few 'summary statistics': They are all (!) public. A lot of last uploads in 2015, some in 2017 and 2022, and a smattering of other years (oddly, no 2018 last uploads). The average number of submissions is around 3-4, but there are two humps in the distribution: maybe 25% have 1 submission, and maybe 50% have 5 or 6. No other obvious patterns. Most of them look well-maintained in that they have descriptions and status. Most have a significant number of classes, none have < 25 classes.
Obviously deletable: Syrian Movies Ontology (and if they've started to adopt BioPortal, lord help us—MMI ORR gets about 100-250 uploads every fall and spring quarter, from a class taught in the Syrian Virtual University (we just make them private but they haven't seemed to be threatening in any way, it's honest education). Strongly suggest we stay on top of those.
Possible deletion:
the rest are meaningful, many are maintained, and maybe 25% are valuable or likely so.
https://bioportal.bioontology.org/ontologies/MAMOVIEW. (Math Ontolofy View)
Proposed draft email to send to each user:
Hello,
As maintainers of BioPortal, we would like to inform you of changes we will be making that will impact your ontology's representation in BioPortal. The change will affect the following ontology(ies):
Regarding how to validate ontologies and catch this error, Chris M writes:
This is not straightforward as there are a lot of different classes of error there…
for things like illegal IRIs the owlapi can actually be too permissive which makes Protege or ROBOT not great first lines of defense… BBChris Mungall
See https://github.com/INCATools/ontology-development-kit/issues/691
[#691 Add a validation check that the RDF/XML parses using Rust RDF/XML parser](https://github.com/INCATools/ontology-development-kit/issues/691)
Not all RDF/XML parsers behave the same way
I have seen cases where Jena is stricter than OWLAPI, and where the Rust parser is stricter still.
E.g pipe symbols in URIs: [monarch-initiative/vertebrate-breed-ontology#51](https://github.com/monarch-initiative/vertebrate-breed-ontology/issues/51)
We should ensure that RDF/XML produced is consumable by the union of all production-level parsers.
I believe fastobo wraps the Rust RDF/XML parser but fastobo itself may impose an additional level of strictness
I think rdftab is distributed with ODK so that could also be used to check.
if you want to stay in the java universe the consensus is to also throw in a quick check with jena in strict mode but that will only find things that are syntactically wrong with the RDF that is not caught by the owlapi
things like checking if the range of owl:deprecated is a boolean is another kettle of fish
AllegroGraph flagged 2022-10-06T12:00:00.000-05:00
as non-compliant with the date-time datatype; however,
according to https://www.w3.org/TR/xmlschema-2/#dateTime-timezones it should be valid.
Error in BIBFRAME ontology:
Error sending data to triple store - 400 RestClient::BadRequest: MALFORMED DATA: \n 2022-10-06T12:00:00.000-05:00\n
is not of type date-time"]
BIBRAME rdf file contains the following:
<dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
2022-10-06T12:00:00.000-05:00
</dcterms:issued>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
2022-10-18T15:48:05.075-04:00
</dcterms:modified>
Perhaps the problem is with the newline
Simple unit tests written outside of the BioPortal stack show no errors loading the BIBFRAME ontology:
// OWL API
@Test
public void testLoad_BIBFRAME_Ontology() throws Exception {
File file = new File("src/test/resources/bibframe.rdf");
FileDocumentSource fileDocumentSource = new FileDocumentSource(file);
OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
OWLOntology ontology = manager.loadOntologyFromOntologyDocument(fileDocumentSource);
assertNotNull(ontology);
System.out.println(ontology.getOntologyID());
System.out.println(ontology.getAxiomCount());
}
// Output:
// OntologyID(OntologyIRI(<http://id.loc.gov/ontologies/bibframe/>) VersionIRI(<http://id.loc.gov/ontologies/bibframe-2-2-0/>))
// 2468
// Jena
@Test
public void testLoadBibFrameOntology() {
String inputFileName = "src/test/resources/bibframe.rdf";
Model model = ModelFactory.createDefaultModel();
InputStream in = RDFDataMgr.open(inputFileName);
if (in == null) {
throw new IllegalArgumentException("File: " + inputFileName + " not found");
}
model.read(in, null);
System.out.println(model.size());
}
// Output:
// 2502
I suspect there may be some issue here with newline characters in the values.
Opening BIBFRAME in Protege I think helps to demonstrate the issue with newlines. The display shows the ontology-level annotations in red, and if you open the edit dialog for any of those annotations, you can see the newline characters:
we have a number of ontologies that are failing to process with AG backend
Ontologies:
ADALAB ADALAB-META ADMIN BIBFRAME BIM BIN CMECS2012 DPCO EMO EO FO FOODON GBM HIVO004 HNS HO IDOBRU ISO-FOOD KISAO LDA MAMO MAMOVIEW MRO MWS22 NANDO NOMEN NXDX OF ONTONEO ONTONEO-DOC PCALION PDO PMDO SIMON SMASH SMASHBIOMARKER SMASHPHYSICAL SMASHSOCIAL SMO TCDO XLMOD