schemaorg / schemaorg

Schema.org - schemas and supporting software
https://schema.org/
Apache License 2.0
5.36k stars 820 forks source link

Closed-world approach to SHACL shape for schema.org #3408

Open mhoangvslev opened 9 months ago

mhoangvslev commented 9 months ago

I encountered this problem: https://github.com/RDFLib/pySHACL/issues/215

In summary, I purposefully generated an erroneous markup where nutrition is part of Product and used pySHACL to validate it. Because of the lack of constraint, pySHACL could not infer that. The issue arises from the open-world nature of RDF, and SHACL rules can be used to constrain the usage of schema:nutrition to specific classes if a more closed-world approach is desired.

Is there a possibility to further refine the OWL ontology for schema.org?

mfhepp commented 9 months ago

I do not think we should add any constraints using OWL axioms (e.g. disjointness axioms etc.), for the following reasons:

  1. The semantics of domain and range in OWL and RDFS is not widely understood and counterintuitive for many developers (see here for the RDFS mechanism for domain and range and here for the refined OWL semantics):
    • In a nutshell, in plain RDFS, the rdfs:domain of a property is a mere cue that if that property is applied to an entity, that this entity is then of that type, i.e. adding an informal hint of an additional type membership for that entity.
    • In OWL, the semantics of rdfs:domain and rdfs:range is more formally defined in that an actual additional rdf:type assertion will be added to either the subject or the object of the respective triple, e.g. either the product or the value.
  2. Schema.org defined its own mechanism via schema:domainIncludes and schema:rangeIncludes in order to avoid practical problems that might arise from the naive usage of rdfs:range and rdfs:domain, e.g. that additional type membership assertions are added by an OWL reasoner instead of any kind of error message.
  3. The exact semantics of these two properties is vague by design, because it will depend on the data and the application what will be the most appropriate action:
    • A publisher of data will most likely want to detect the usage of incompatible or undefined properties.
    • A consumer of data may want to ignore either the individual property or the entire block of data or even discard the entire dataset. At Web scale, however, a consumer will instead often want to try to repair many of such errors in data (e.g. https vs. http namespace, maybe even some spelling mistakes.
  4. There is no need to add such SHACL or OWL axioms directly to schema.org, because
    • a standard check that only properties defined for the type can be implemented e.g. in SPARQL or another query language with ease;
    • a SHACL file that derives hard constraints from the vague domain and range information can be produced automatically.
  5. It may even be harmful, because such statements are not as generally applicable as the rest of the vocabulary. Note that schema.org tries to strike a fine balance, in many subtle parts of the design, between precision on one hand, and ambiguity on the other hand. This is a very old problem, see e.g. the Wikipedia page on Ontological Commitment.

As for SHACL: IMO, what would be a good approach was if one set of authoritative SHACL shapes for classes and properties was automatically produced for each release and added to a release as a separate resource.

There are some tools (not tested them myself) that might help with that task (this will require adding support for the schema-specific domain and range properties):

Hope you find this long ;-) comment useful!

mhoangvslev commented 9 months ago

For future readers who require citation, here is one:

mhoangvslev commented 9 months ago

After a while, I figured out a quick and dirty way to perform the type checking under CWA:

def close_ontology(graph: ConjunctiveGraph):
    """Load an input SHACL shape graph and close each shape 
    by bringing all property from parent class to currend class shape 
    then add sh:closed at the end
    """             
    query = f"""
    SELECT DISTINCT ?shape ?parentShape ?parentProp WHERE {{
        ?shape  a <http://www.w3.org/ns/shacl#NodeShape> ;
                a <http://www.w3.org/2000/01/rdf-schema#Class> ;
                <http://www.w3.org/2000/01/rdf-schema#subClassOf>* ?parentShape .

        ?parentShape <http://www.w3.org/ns/shacl#property> ?parentProp .
        FILTER(?parentShape != ?shape)
    }}
    """ 

    results = graph.query(query)
    visited_shapes = set()
    for result in results:
        shape = result.get("shape")
        parent_prop = result.get("parentProp")
        graph.add((shape, URIRef("http://www.w3.org/ns/shacl#property"), parent_prop))
        graph.add((shape, URIRef("http://www.w3.org/ns/shacl#closed"), Literal(True)))

        # subj sh:ignoredProperties ( rdf:type owl:sameAs )
        # https://www.w3.org/TR/turtle/#collections
        if shape not in visited_shapes:
            ignored_props = graph.collection(BNode())
            ignored_props += [URIRef("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"), URIRef("http://www.w3.org/2002/07/owl#sameAs")]

            graph.add((shape, URIRef("http://www.w3.org/ns/shacl#ignoredProperties"), ignored_props.uri))
            visited_shapes.add(shape)

    # Replace xsd:float with xsd:double
    for prop in graph.subjects(URIRef("http://www.w3.org/ns/shacl#datatype"), URIRef("http://www.w3.org/2001/XMLSchema#float")):
        graph.set((prop, URIRef("http://www.w3.org/ns/shacl#datatype"), URIRef("http://www.w3.org/2001/XMLSchema#double")))

    return graph
github-actions[bot] commented 6 months ago

This issue is being nudged due to inactivity.