ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

AG: failures loading ontologies with "RestClient::BadRequest: 400 Bad Request: MALFORMED DATA: ___ is not of type ___" #253

Open alexskr opened 1 year ago

alexskr commented 1 year ago

we have a number of ontologies that are failing to process with AG backend

 ERROR -- : ["Error sending data to triple store - 400 RestClient::BadRequest: MALFORMED DATA: `28-06-2017` is not of type date"]
E, [2022-08-25T01:12:48.780662 #4829] ERROR -- : ["RestClient::BadRequest: 400 Bad Request\n/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:249:in `exception_with_response'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:129:in `return!'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/request.rb:836:in `process_result'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/request.rb:743:in `block in transmit'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/2.7.0/net/http.rb:933:in `start'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/request.rb:727:in `transmit'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/request.rb:163:in `execute'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/gems/rest-client-2.1.0/lib/restclient/request.rb:63:in `execute'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/bundler/gems/goo-562826ba21f7/lib/goo/sparql/client.rb:116:in `append_triples_no_bnodes'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/bundler/gems/goo-562826ba21f7/lib/goo/sparql/client.rb:141:in `put_triples'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/bundler/gems/ontologies_linked_data-b5057b491f4c/lib/ontologies_linked_data/models/ontology_submission.rb:1543:in `delete_and_append'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/bundler/gems/ontologies_linked_data-b5057b491f4c/lib/ontologies_linked_data/models/ontology_submission.rb:482:in `generate_rdf'\n\t/srv/ncbo/ncbo_cron_ag/vendor/bundle/ruby/2.7.0/bundler/gems/ontologies_linked_data-b5057b491f4c/lib/ontologies_linked_data/models/ontology_submission.rb:980:in `process_submission'\n\t/srv/ncbo/ncbo_cron_ag/lib/ncbo_cron/ontology_submission_parser.rb:177:in `process_submission'\n\tbin/ncbo_ontology_process:98:in `block in <top (required)>'\n\tbin/ncbo_ontology_process:81:in `each'\n\tbin/ncbo_ontology_process:81:in `<top (required)>'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/cli/exec.rb:58:in `load'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/cli/exec.rb:58:in `kernel_load'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/cli/exec.rb:23:in `run'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/cli.rb:483:in `exec'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/cli.rb:31:in `dispatch'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/cli.rb:25:in `start'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/exe/bundle:48:in `block in <top (required)>'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/lib/bundler/friendly_errors.rb:103:in `with_friendly_errors'\n\t/usr/local/rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/bundler-2.3.13/exe/bundle:36:in `<top (required)>'\n\t/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `load'\n\t/usr/local/rbenv/versions/2.7.6/bin/bundle:23:in `<main>'"]

Ontologies:

ADALAB ADALAB-META ADMIN BIBFRAME BIM BIN CMECS2012 DPCO EMO EO FO FOODON GBM HIVO004 HNS HO IDOBRU ISO-FOOD KISAO LDA MAMO MAMOVIEW MRO MWS22 NANDO NOMEN NXDX OF ONTONEO ONTONEO-DOC PCALION PDO PMDO SIMON SMASH SMASHBIOMARKER SMASHPHYSICAL SMASHSOCIAL SMO TCDO XLMOD

mdorf commented 1 year ago

It appears that AG doesn't like this line in the source ontology (I tested the ADMIN ontology):

<http://www.semanticweb.org/philshields/ontologies/2015/4/Administrator.owl> <http://purl.org/dc/elements/1.1/date> "Oct 28, 2013 9:03:53 PM"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

The underlying error is:

MALFORMED DATA: `Oct 28, 2013 9:03:53 PM` is not of type date-time
mdorf commented 1 year ago

Parsing TCDO, I get a similar error on line:

<http://OntoTCM.org.cn/ontologies/TCDO> <http://www.geneontology.org/formats/oboInOwl#creation_date> "June 6, 2021"^^<http://www.w3.org/2001/XMLSchema#decimal> .

error:

MALFORMED DATA: `June 6, 2021` is not of type xsd:decimal
mdorf commented 1 year ago

This issue is related to: ncbo/goo#111

graybeal commented 1 year ago

As I said I would, I took a look at the ontologies listed above, seeing if there were any obvious patterns and how much usage they were getting.

A few 'summary statistics': They are all (!) public. A lot of last uploads in 2015, some in 2017 and 2022, and a smattering of other years (oddly, no 2018 last uploads). The average number of submissions is around 3-4, but there are two humps in the distribution: maybe 25% have 1 submission, and maybe 50% have 5 or 6. No other obvious patterns. Most of them look well-maintained in that they have descriptions and status. Most have a significant number of classes, none have < 25 classes.

Obviously deletable: Syrian Movies Ontology (and if they've started to adopt BioPortal, lord help us—MMI ORR gets about 100-250 uploads every fall and spring quarter, from a class taught in the Syrian Virtual University (we just make them private but they haven't seemed to be threatening in any way, it's honest education). Strongly suggest we stay on top of those.

Possible deletion:

the rest are meaningful, many are maintained, and maybe 25% are valuable or likely so.

alexskr commented 1 year ago

https://bioportal.bioontology.org/ontologies/MAMOVIEW. (Math Ontolofy View)

graybeal commented 1 year ago

Proposed draft email to send to each user:

Hello,

As maintainers of BioPortal, we would like to inform you of changes we will be making that will impact your ontology's representation in BioPortal. The change will affect the following ontology(ies):

In the near future we will be converting our semantic store from 4store to AllegroGraph. When this happens, your ontology(ies) as they currently exist will no longer be accepted by AllegroGraph, because they have one or more statements that are syntactically incorrect. An example of such statements is > "Oct 28, 2013 9:03:53 PM"^^ . which gives the error > MALFORMED DATA: `Oct 28, 2013 9:03:53 PM` is not of type date-time You can detect such statements by using a tool like ?? to validate your ontology, and correct any errors before submitting it to BioPortal. For example, the following command will detect these errors: > INSERT COMMAND HERE (see next comment) If you do nothing, when we transition to AllegroGraph your ontologies contents will no longer be searchable or available in the BioPortal system. After a period of time (not less than one month), we will delete the ontology if it is not updated. Please contact us if you need assistance with submitting your updated ontology.
graybeal commented 1 year ago

Regarding how to validate ontologies and catch this error, Chris M writes:

This is not straightforward as there are a lot of different classes of error there…

for things like illegal IRIs the owlapi can actually be too permissive which makes Protege or ROBOT not great first lines of defense… BBChris Mungall

See https://github.com/INCATools/ontology-development-kit/issues/691

[#691 Add a validation check that the RDF/XML parses using Rust RDF/XML parser](https://github.com/INCATools/ontology-development-kit/issues/691)
Not all RDF/XML parsers behave the same way
I have seen cases where Jena is stricter than OWLAPI, and where the Rust parser is stricter still.
E.g pipe symbols in URIs: [monarch-initiative/vertebrate-breed-ontology#51](https://github.com/monarch-initiative/vertebrate-breed-ontology/issues/51)
We should ensure that RDF/XML produced is consumable by the union of all production-level parsers.
I believe fastobo wraps the Rust RDF/XML parser but fastobo itself may impose an additional level of strictness
I think rdftab is distributed with ODK so that could also be used to check.

if you want to stay in the java universe the consensus is to also throw in a quick check with jena in strict mode but that will only find things that are syntactically wrong with the RDF that is not caught by the owlapi

things like checking if the range of owl:deprecated is a boolean is another kettle of fish

alexskr commented 7 months ago

AllegroGraph flagged 2022-10-06T12:00:00.000-05:00 as non-compliant with the date-time datatype; however, according to https://www.w3.org/TR/xmlschema-2/#dateTime-timezones it should be valid.

Error in BIBFRAME ontology: Error sending data to triple store - 400 RestClient::BadRequest: MALFORMED DATA: \n 2022-10-06T12:00:00.000-05:00\n is not of type date-time"]

BIBRAME rdf file contains the following:

  <dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
   2022-10-06T12:00:00.000-05:00
  </dcterms:issued>
  <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
   2022-10-18T15:48:05.075-04:00
  </dcterms:modified>

Perhaps the problem is with the newline

jvendetti commented 7 months ago

Simple unit tests written outside of the BioPortal stack show no errors loading the BIBFRAME ontology:

// OWL API
@Test
public void testLoad_BIBFRAME_Ontology() throws Exception {
    File file = new File("src/test/resources/bibframe.rdf");
    FileDocumentSource fileDocumentSource = new FileDocumentSource(file);
    OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
    OWLOntology ontology = manager.loadOntologyFromOntologyDocument(fileDocumentSource);
    assertNotNull(ontology);
    System.out.println(ontology.getOntologyID());
    System.out.println(ontology.getAxiomCount());
}
// Output:
// OntologyID(OntologyIRI(<http://id.loc.gov/ontologies/bibframe/>) VersionIRI(<http://id.loc.gov/ontologies/bibframe-2-2-0/>))
// 2468

// Jena
@Test
public void testLoadBibFrameOntology() {
    String inputFileName = "src/test/resources/bibframe.rdf";
    Model model = ModelFactory.createDefaultModel();
    InputStream in = RDFDataMgr.open(inputFileName);
    if (in == null) {
        throw new IllegalArgumentException("File: " + inputFileName + " not found");
    }
    model.read(in, null);
    System.out.println(model.size());
}
// Output:
// 2502

I suspect there may be some issue here with newline characters in the values.

jvendetti commented 7 months ago

Opening BIBFRAME in Protege I think helps to demonstrate the issue with newlines. The display shows the ontology-level annotations in red, and if you open the edit dialog for any of those annotations, you can see the newline characters:

Screen Shot 2023-11-17 at 11 41 55 AM

Screen Shot 2023-11-17 at 11 42 04 AM