owlcs / owlapi

OWL API main repository
821 stars 315 forks source link

Removed data types in 4.5.22 #1063

Closed hkir-dev closed 1 year ago

hkir-dev commented 2 years ago

I am trying to migrate to owlapi 4.5.22, but seems it is not preserving the declared data types.

I have a simple ontology:

<?xml version="1.0"?>
<rdf:RDF xmlns="http://example.com#"
     xml:base="http://example.com"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
    <owl:Ontology rdf:about="https://github.com/owlapi/foo.owl"/>

    <owl:Class rdf:about="https://github.com/owlapi/foo.owl#test">
        <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">test</rdfs:label>
    </owl:Class>

</rdf:RDF>

When loaded the ontology and saved back to a file, datatype of the label is removed:

<rdfs:label>test</rdfs:label>

When tried with owlapi 4.5.21, still have the datatype:

<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">test</rdfs:label>

Used the following code for the read/write operation:

      OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
      OWLOntology loadedOntology = manager.loadOntologyFromOntologyDocument(
          new FileDocumentSource(new File("./src/main/resources/edit2.owl")));

      manager.saveOntology(loadedOntology, new RDFXMLDocumentFormat(),
          new StreamDocumentTarget(new FileOutputStream("./src/main/resources/edit2_out.owl")));
ignazio1977 commented 2 years ago

The datatype for both outputs is still xsd:string; it's optional in output since it's the default if no datatype is present. This was a cherry pick of a bug reported in version 5 a couple of years ago, which had not been included correctly in version 4.

Do you need the datatype explicitly in the XML?

hkir-dev commented 2 years ago

Yes, we are doing some diff operations, so we want to preserve the input ontology's representation. Is there a way to preserve datatypes if they are declared and keep plain literals (without types) also plain in 4.5.22? Because, I think, this was the default behavior in 4.5.21.

ignazio1977 commented 2 years ago

Ironically preserving diffs was also the reason for the previous bug report. This fix was to amend a previous regression that was affecting diffs in some ontologies. Not sure if there's a mechanism already available, if not I'll add it.

matentzn commented 2 years ago

Hey @ignazio1977 :) Can I make sure I understand - are unnecessary (redundant) dts being stripped in 4.5.22 by accident or by design? I think the correct way would be to preserve whatever the input says: a load-safe should not create any diff, under no (or rare) circumstances - roundtrip?

ignazio1977 commented 2 years ago

Preserving the input when the input has variability (order, default values explicit, redundant axioms - e.g., duplicate axioms...) is not doable without keeping the input as is, especially across versions :-( The api isn't updating the file in place, only changing it where needed, it's producing a new file. To keep all the information needed to always replicate these variations we'd need to parse and keep a copy of the parsed input around, to be able to save with minimal changes. It's a big requirement in itself, and it's not possible to reduce variability and minimise deltas at the same time when picking two versions of the code at random.

It all goes back to the file isn't the ontology, just a representation of it, and the ontology in memory is yet another representation.

ignazio1977 commented 2 years ago

@matentzn this was the fix for the random changes you reported a while ago :-D fixing the OWLAPI output variations is like washing two Labradors in a bathtub. Or picking a fight with a hog.

matentzn commented 2 years ago

I am cool with whatever design decision makes sense. Basically, if we all moved to 4.5.22, we would see a bunch of diffs for a while until all files are updated, and then everything would stabilise. But if, say, a tool wants xsd:string DTs on a label for some reason, the decision is now that its dropped because it is redundant, right? Do I understand this correctly?

matentzn commented 2 years ago

cc @matthewhorridge @cmungall @balhoff

ignazio1977 commented 2 years ago

The fix whose effects we are discussing is the fix for #640. Commit comment, I think, captures the intent:

Most syntaxes do not require xsd:string to be outputted explicitly for literals without language tags.

Plain literals need a language tag, possibly empty; langString values need a non empty language tag; literals missing a language tag are equivalent to xsd:string typed literals, hence no need for an explicit typing.

So literals like the ones above are specified to be of type xsd:string regardless of the presence of the datatype in the surface syntax. Removing it from the output is something where the implementation has gone back and forth, hence the surprises that have appeared a few times, most recently in #1061 and #1004 - #1004 was the one that caused us much headscratching because of difficulties replicating exactly the behaviour, and in that case the input was in a form that matches the output from 4.5.22, i.e., without explicit xsd:string, and (part of) the surprise was xsd:string appearing - a symptom of #640 not having been applied to the version 4 branch. It was only with #1061 examples that I realised #640 was fixed only on version 5 and 6, not 4.

I believe here we're discussing, "what if a tool wants or needs that explicit value?", and the response I have in mind is that we add a configuration option to make it explicit, so that users who don't need it get shorter files and users who do have a way to get what they need.

Ideally one could minimise deltas within an existing repo with some clever rewriting (e.g., when making a change, create two commits: one with the ontology written pre-change with the current OWLAPI version, and one post change - the first commit would differ from the original file only in formatting details due to any OWLAPI changes, and the second would only have the semantic changes applied by the user; and, if the first commit turns out to be empty, one skips it. I don't have enough knowledge of APIs for git to implement something like this, but I think it'd be a very interesting project.)

One could also rewrite the whole repository with the latest API, but that would have consequences on the history that might be undesired.

matentzn commented 2 years ago

Ok, we have a lot of experience with normalising serialisations, no need to worry here;

I think if you maintain that the current solution in 4.5.22 is consistent, I am not so worried about the xsd:string situation. Its still much better than before, where this was widely inconsistent (some literals got xsd:string, others didn't).

@balhoff @cmungall @jamesaoverton @matthewhorridge

Executive summary:

Are we ok with that?

matthewhorridge commented 2 years ago

This is the right thing to do IMO

jamesaoverton commented 2 years ago

My understanding -- please correct me if I'm mistaken -- is that this is a difference between RDF 1.0 and RDF 1.1 specifications: https://www.w3.org/TR/rdf11-new/#literals

OWL (thus OWLAPI) is defined in terms of RDF 1.0. Apache Jena, for instance, uses RDF 1.1. ROBOT uses both OWLAPI and Jena, and mixing them causes these problems with diffs. There would be some short-term pain, but it would be more convenient for ROBOT users if OWLAPI and Jena to handle literals in the same way.

ignazio1977 commented 2 years ago

@jamesaoverton support for rdf 1.1 literals is what got added to version 5 but didn't work properly in version 4 until now. (At least in intention) 4.5.22 supports rdf 1.1 literals.

ignazio1977 commented 1 year ago

Commit ready. To force explicit xsd:string output, use

format.setParameter("force xsd:string on literals", Boolean.TRUE);

This will work from next release. Note: RIO formats don't honor that parameter.