owlcs / owlapi

OWL API main repository
828 stars 315 forks source link

Missing annotation assertions when reading RDF triples with a literal in the object position and a blank node in the subject position #100

Closed marstran closed 10 years ago

marstran commented 10 years ago

When reading this turtle file:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix ex: <http://www.example.org/> .
[] rdfs:label "Visible" ;
   ex:pred ex:Visible ;
   ex:pred "Not visible" .
ex:subj rdfs:label "Visible" .
ex:subj ex:pred "Visible" .

Using this code:

public static void main(String [] args) throws Exception {
    OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
    InputStream is = new FileInputStream("test.ttl");
    OWLOntology oo = manager.loadOntologyFromOntologyDocument(is);
    for(OWLAxiom oa : oo.getAxioms()){
        System.out.println(oa);
    }
}

The following is printed out:

AnnotationAssertion(rdfs:label <http://www.example.org/subj> "Visible")
AnnotationAssertion(rdfs:label _:genid1 "Visible")
AnnotationAssertion(<http://www.example.org/pred> _:genid1 <http://www.example.org/Visible>)
AnnotationAssertion(<http://www.example.org/pred> <http://www.example.org/subj> "Visible")

I expected this output:

AnnotationAssertion(rdfs:label <http://www.example.org/subj> "Visible")
AnnotationAssertion(rdfs:label _:genid1 "Visible")
AnnotationAssertion(<http://www.example.org/pred> _:genid1 <http://www.example.org/Visible>)
AnnotationAssertion(<http://www.example.org/pred> <http://www.example.org/subj> "Visible")
AnnotationAssertion(<http://www.example.org/pred> _:genid1 "Not visible")

However, by adding the following triple to test.ttl:

ex:pred rdf:type owl:AnnotationProperty .

I got the output:

Declaration(AnnotationProperty(<http://www.example.org/pred>))
AnnotationAssertion(rdfs:label <http://www.example.org/subj> "Visible")
AnnotationAssertion(<http://www.example.org/pred> _:genid1 "Not visible")
AnnotationAssertion(rdfs:label _:genid1 "Visible")
AnnotationAssertion(<http://www.example.org/pred> _:genid1 <http://www.example.org/Visible>)
AnnotationAssertion(<http://www.example.org/pred> <http://www.example.org/subj> "Visible")

This is better. However, the files I am going to read do not contain that declarating triple. The bug happens only when the subject is a blank node and the object is a literal (according to my tests). I also did a small test with RDF/XML, and the same thing seemed to happen there.

matthewhorridge commented 10 years ago

Technically, you need the declaration (ex:pred rdf:type owl:AnnotationProperty) for the document to be a valid OWL 2 DL document, and for it to parse properly. Parsing without the declaration causes the OWL API to guess that ex:pred is an annotation property. Nevertheless, it would be nice if triples between IRIs, Anonymous Individuals and Literals were treated the same.

sesuncedu commented 10 years ago

I think there is a bug here, though I'm not sure exactly where. Not-quite-DL RDF-OWL loose parsing seems to have some odd issues.

Thought 1:

Since parsers are given an ontology to parse in to, it might be useful to add a method to OWLOntologyManager that would allow for a document to be loaded into an existing ontology. That would make patching in axioms for missing declarations so that they are available before parsing easy.

This would require a minor change to OWLRDFConsumer to load IRIs in the existing signature into the appropriate type sets, like in the currentimportsClosureChanged method (in fact, implementation is basically doing an extractMethod on the body of the loop, and calling the extractedMethod from the constructor. ) I think this would also do it for RIO.

I think Manchester is the only other parser that would need a tweak to populate the names table from setDefaultOntology.

Thought 2:

I'm not sure that guessing at declarations as being annotations is the right behavior by default (I am not sure if the LAX/Strict mode controls this everywhere). It might be worth adding a delegate or handler than can interact with the user, or throw an exception (sort of like a lisp cerror).

There are some cases where the property type(s) can be inferred, even though the official spec doesn't allow it (e.g. one can make a fair guess if an undeclared IRI is a subproperty of a typed property).
Reading a mostly T-Box document , annotation may be the most likely property assertion. In an A-Box heavy document, object property or data property assertions may be more likely.
This might require a fairly hefty degree of buffering.

matthewhorridge commented 10 years ago

Hi Simon,

The parsing is indeed... strange....

Some background (I don't know if it's helpful to know this or not) and some thoughts.

If something can't be parsed as an OWL 2 DL document then the parser essentially uses some (undocumented and fragile) heuristics to try and parse something. It would be good if these strategies were pluggable, with one being a NoOp strategy for strict parsing. I don't believe the Lax/Strict mode has ever worked btw.

The idea of lax parsing was carried over from the OWL API 1 and I think was originally due to a lot of the OWL test cases that had missing type triples etc. and the fact that a lot of published ontologies are not in OWL DL (as it was then) for mainly trivial reasons. Also, at one stage during its development, punning between object, data and annotation properties was in the OWL 2 spec but got dropped due to objections from some within the RDF community. Rather than reject ontologies that use such punning the parser tries to parse them as best it can - this being deemed less hostile to users of Protege and similar tools. At one stage there were a lot of ontologies out there with property punning - I'm not sure what the state of this is now, but it was mainly due some earlier versions of Protege (2 and 3) which unfortunately typed annotation properties as either object or data properties as well. I think it would be good to preserve lax parsing, but it should obviously be configurable.

With regards to Thought 1, certainly this would be possible for Version 4. For Version 3, I don't think it's wise to change the OWLOntologyManager interface. It might be possible to achieve the desired behaviour via a special OWLOntologyFactory however.

With regards to Thought 2, I think annotations were chosen as the safest way of slurping up extra triples (and works nicely for things like the dublin core ontology and many common cases), but this obviously doesn't always work. I like the idea of a delegate/handler that's pluggable. The trouble is choosing between an annotation or data property and actually knowing whether it's ABox or TBox. Finally, loading all triples first allows full inspection of the graph and is the canonical way of parsing things, but can be slow and can consume lots of memory and in a lot of cases isn't actually necessary - which is why what appear to be safe cases are dealt with in a streaming manner (obviously, this isn't foolproof).

ignazio1977 commented 10 years ago

The lax parsing has to make assumptions and hit a quite difficult tradeoff between what the specs require and what people actually do in the wild. It gets dicey quickly.

About thought number one, it's not a bad idea. I know there are other libraries going through hoops to manually fill in triples into our consumers, and that requires them relying on implementation classes and torturing the code in horrible ways to do so. A clean interface to feed triples or axioms in would be good. It's not the first time that explicitly passing in ontologies to parsing is mentioned here. I can't see the issue where this was discussed, though.

ignazio1977 commented 10 years ago

Reading Matt's answer: in version 4 the actual ontology construction is delegated to OWLOntologyBuilder implementations. These can be changed by playing with the Guice injector, or ad hoc by replacing the ontology factories in an already built OWLOntologyManager. It should be possible to initialize an ontology with a set of declarations, although it might take some coding and it's a bit less versatile than being able to explicitly pass in an ontology, or having a preparse hook on which axioms could be funnelled.

matthewhorridge commented 10 years ago

The lax parsing has to make assumptions and hit a quite difficult tradeoff between what the specs require and what people actually do in the wild. It gets dicey quickly."

This is so true. I'm also inclined to think that ontology defects have changed over the years, and will continue to change as different tools come and go.

A clean interface to feed triples or axioms in would be good. It's not the first time that explicitly passing in ontologies to parsing is mentioned here. I can't see the issue where this was discussed, though."

It might have been related to something we worked on here... we wanted the ability to efficiently copy ontologies between managers and some of the ideas for interfaces looked similar to this.

ignazio1977 commented 10 years ago

The one I remember was issue #73 where some ambiguity in the javadocs was cleared. But, yeah, there's been people making encouraging noises in this area. We should have a good look at some use cases.

sesuncedu commented 10 years ago

I'm definitely thinking of version 4 for API changes (I've been patching against that branch).

Old ontology defects are strongly conserved; they get added to by mutated duplicates.
The standard tool chain seems to be protege -> top braid -> wordpad.

What I'm actually supposed to be working on is on a higher level testing tools for BDD & TDD ontology development and SME validation. Syntactic validity would be nice (or even imports that don't require googling :-/.