ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.
Apache License 2.0
78 stars 34 forks source link

UCO should import the Collections Ontology to handle ordered lists #389

Closed ajnelson-nist closed 2 years ago

ajnelson-nist commented 2 years ago

Disclaimer

Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose.

Background

The Collections Ontology provides implementations of several Set- and Set-adjacent concepts, as an OWL 2 DL ontology.

UCO has a need to provide the ability to represent ordered lists, where the order is not necessarily determinable or recordable by some "keying" ordering property (e.g. a timestamp, or an incrementing ID number). This need exists for file fragmentation in file system analysis (especially for reporting from where a file was recovered and in what order pieces were put together), message threading, and other applications.

UCO should not invest effort in designing an independent ordered list representation. In RDF, especially OWL 2 DL-based RDF, ordered lists are non-trivial to represent due to requirements imposed on RDF lists. In particular, OWL 2 DL requires RDF lists be blank nodes, and that they never fork. These requirements pose challenges for some UCO applications, necessitating some class-defining work be done to implement linked lists. This proposal suggests importing an ontology that has already carried out siginificant review of list implementation through the lens of OWL 2 DL requirements.

The source code of the Collections Ontology is trackable here:

https://github.com/collections-ontology/collections-ontology

A research article documenting and evaluating the ontology is here:

https://doi.org/10.3233/SW-130121

Requirements

Requirement 1

UCO must be able to provide an ability to represent an ordered list.

Requirement 2

UCO users must be able to validate usage of UCO's adopted and/or implementd ordered list concepts.

Requirement 3

UCO must be able to demonstrate compatibility with classes and properties of other independently-developed ontologies.

Requirement 4

The version of the Collections Ontology against which UCO develops its SHACL shapes must be known to the UCO user.

Risk / Benefit analysis

Benefits

Risks

With this being potentially UCO's first import of an external ontology, there are several nontrivial points to consider.

Risk 1 - Linearity of CO List

The first class of UCO interest in CO, co:List, is linear only. (This is confirmable with some of the list member linking properties being owl:FunctionalPropertys, implying that after OWL inferencing, any co:ListItem would only have one next co:ListItem after owl:sameAs is applied.) Forking a list is not supported, which falls short of the needs of one of the intended first adopters of ordering, observable:MessageThread.

Fortunately, some of the superclasses and superproperties of co:List and its properties provide sufficient basis to build a forking variant similar to co:List. That variant is provided in a separate proposal scoped to observable:MessageThread adopting CO.

Risk 2 - Intentionally incomplete coverage of SHACL

When importing CO, there is a question of how much of this ontology should be as usable and testable to UCO users as UCO ontology concepts. That is, how much validation capability should UCO provide (or at least, incubate)?

This proposal's accompanying PR implements the minimal SHACL shapes needed to get a set of unit tests to pass. Those tests demonstrate expected correct and incorrect usage of concepts that will be needed to support (1) UCO's needs of MessageThread (coming in a separate PR) and (2) estimated needs to support some concept that uses the linear co:List (a yet-unnamed file fragmentation representation, and/or disk partition systems).

The PR for this adoption of CO goes no further with defining SHACL shapes. Interested community members should feel free to expand the coverage if they wish.

Risk 2.1 - Other integer types

CO employs specific integer types on some properties, xsd:nonNegativeInteger and xsd:positiveInteger. A community member provided early feedback on this ontology, suggesting these be relaxed in SHACL review because JSON-LD output size grew considerably.

Contrary to the incomplete SHACL coverage of CO, the proposed SHACL enforcement respects the datatype designations, and works to ensure that data validated with SHACL is consistent with non-UCO usage of CO concepts.

The extra JSON-LD file weight can be reverted by usage of JSON-LD context dictionaries.

Risk 3 - Transitive closure - "error" ontology

CO imports a utility ontology, the Error Ontology. To import CO is, through transitive closure, to import the Error Ontology, and its sole property, an annotation property named "error" with a maximum cardinality of 0. Its usage model is, if it appears, declare the graph OWL-inconsistent.

Whether to implement a SHACL restriction for this property is left out of scope of this proposal. The risk of importing the Error Ontology is believed to be 0.

Risk 4 - Revisions to SHACL coding style may exceed documentation capabilities

(This risk pertains to the Solutions Approval phase of the Change Proposal process, but since a solution is being provided, we should feel free to discuss it earlier if beneficial.)

The PR accompanying this proposal changes how validation of individual properties occurs. To date, SHACL properties in UCO have been inlined in class definitions as anonymous sh:PropertyShape individuals. E.g. this excerpt of UCO's core Turtle defines a property shape, which establishes requirements for core:name, but only in the context of the class core:UcoObject:

core:UcoObject
    a
        owl:Class ,
        sh:NodeShape
        ;
    sh:property [
        sh:datatype xsd:string ;
        sh:maxCount "1"^^xsd:integer ;
        sh:nodeKind sh:Literal ;
        sh:path core:name ;
    ] ;
    sh:targetClass core:UcoObject ;
    .

This means core:name can be used with no restrictions on any class that is not a core:UcoObject subclass. This is a programming flaw, and an interested community member should consider stepping in to correct this.

The PR uses a different coding style, making universal constraints universally applicable. The above would be written instead in this manner:

core:UcoObject
    a
        owl:Class ,
        sh:NodeShape
        ;
    sh:property [
        sh:maxCount "1"^^xsd:integer ;
        sh:path core:name ;
    ] ;
    sh:targetClass core:UcoObject ;
    .

core:name-subjects-shape
    a sh:PropertyShape ;
    sh:datatype xsd:string ;
    sh:nodeKind sh:Literal ;
    sh:path core:name ;
    sh:targetSubjectsOf core:name ;
    .

Any usage of core:name should adhere to core:name-subjects-shape. The anonymous property shape in UcoObject implements two things: (1) an association of UcoObject with core:name, and (2) a more stringent constraint-set than the universal constraint-set. (This example happens to require (1) more than (2).)

The reason for using this IRI-named-shape coding style is CO does not have directly encoded class-property associations for several of the relevant properties. Some of the class-property associations are inferrable via rdfs:domain statements and RDFS inferencing (or, in some cases, OWL inferencing).

Other new-to-UCO SHACL coding styles were found necessary to include. One property is defined with a range of the complement of a named class, necessitating a sh:not. One property restricts a value with one level of path-indirection (firstItem must refer to an object with no previousItem), necessitating a two-member sh:path list (see uco-co:firstItem-subjects-previousItem-shape).

All of the above may cause challenges with CASE and UCO's current selection of a documentation generator, and possibly any documentation generator currently available. Some of the above code styles can be rolled back to use UCO's current style (even if the coding ends up redundant), but others do not have more elementary forms available that meet the same level of expressivity.

Risk 5 - Non-suppport of OWL features

The accompanying PR intentionally does not implement support some OWL features pertaining to inferencing. Primarily, this is in handling of identity resolution and some properties that are designated owl:FunctionalPropertys. (If a property P is functional, then a graph with S P T1 and S P T2 would cause an inference that T1 owl:sameAs T2. It is likely the fully correct test for SHACL validation of a owl:FunctionalProperty, after OWL inferencing is applied, would need to rely on SHACL-SPARQL. Such a shape for a property ex:p would be:

ex:p-subjects-shape
    a sh:PropertyShape ;
    sh:path ex:p ;
    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:prefixes ex: ;
        sh:select """
            SELECT $this
            WHERE {
                $this ex:p ?thing1 .
                $this ex:p ?thing2 .
                FILTER ( ?thing1 != ?thing2 )
                FILTER NOT EXISTS {
                    ?thing1 owl:sameAs ?thing2 .
                }
            }
            """ ;
    ] ;
    sh:targetSubjectsOf ex:p ;

(Reminder: Select queries in SHACL-SPARQL find all violations of a shape.)

The accompanying PR chooses to assume OWL inferencing is not in use, on the (untested) assumption that such a query would be expensive for end users to run in their SHACL validation. Instead, sh:maxCount 1 is constrained on all owl:FunctionalPropertys. Any community members interested in OWL inference evaluation should feel encouraged to propose implementing the SHACL-SPARQL pattern in the future. Alternatively, they could be included in the proposal PR, with sh:deactivated applied to keep the tests disabled unless the sh:deactivated statement were deleted by a review process willing to pay the analysis time cost.

Risk 6 - Conflict with Facet strategy

There is significant potential for confusion, due to UCO's usage of Facets, when reviewing what should be the subclass of a co:List (or any externally-developed list). The proposer believes this is best viewed as an opportunity to review elementary UCO design that has to date remained unchanged, and unchallenged, since the prototype days. In particular, why is this the pattern to attach a "Set member" to a core:ContextualCompilation:

{
  "@id": "kb:contextual-compilation-1",
  "@type": "core:ContextualCompilation",
  "core:description": "Compilation of important messages",
  "core:object": {
    "@id": "kb:message-1"
  }
}

while this is the pattern to attach a "List member" (without ordering) to the current implementation of observable:MessageThread?

{
  "@id": "kb:message-thread-1",
  "@type": "observable:MessageThread",
  "core:description": "Thread of important messages",
  "core:hasFacet": {
    "@type": "observable:MessageThreadFacet",
    "observable:message": {
      "@id": "kb:message-1"
    }
  }
}

Risk 7 - co:element and property chain axioms

Consider the node kb:contextual-compilation-1 defined in Risk 6. There is an equivalent property in CO, co:element, that could be used to define a similar structure:

{
  "@id": "kb:contextual-compilation-1",
  "@type": "co:Collection",
  "core:description": "Compilation of important messages",
  "co:element": {
    "@id": "kb:message-1"
  }
}

This is consistent with CO in terms of OWL constraints, and consistent with UCO in form of JSON-LD data. However, co:element is defined as a property chain axiom:

:element
    a owl:ObjectProperty ;
    rdfs:label "has element"@en ;
    rdfs:comment "The link to the members of a collection"@en ;
    rdfs:domain :Collection ;
    owl:propertyChainAxiom (
        :item
        :itemContent
    ) ;
    .

That owl:propertyChainAxiom statement means, if a :element b, then there exists some c such that a :item c and c :itemContent b, and domains and ranges of :item and :itemContent would infer additional characteristics about c.

An OWL inferencing application might take instances of :element and use them to infer and/or require the existence of a node satisfying the b form. It's possible (the proposer is uncertain) that if such a b already existed in the graph as a named node, no new node would be generated; it is also possible a blank node would always be generated, and later resolved as owl:sameAs the named node. If the latter is the case, it is unclear whether SHACL-SPARQL would be needed as with the owl:FunctionalProperty discussion noted in Risk 5.

Due to needing to understand some OWL-SHACL interactions better, this proposal leaves validating co:element with SHACL as out of scope. It should be considered if UCO's core:ContextualCompilation (or some superclass) would be somehow aligned with co:Collection.

Risk 8 - Open question on redistribution of imported ontologies

It is not yet decided in the accompanying Pull Request whether the Collections Ontology, in whole or in part, would be "compiled" into the monolithic UCO ontology. One axiom needed to make SHACL function was copied and cited as copied within the CO SHACL implementation, due to being needed for some SHACL functionality. Should the entirety of CO (as tracked in a Git submodule) be copied into the monolithic build?

Risk 9 - Increased reliance on tooling support

A substantial amount of CASE example data has been generable by hand (that is, by a person rather than a program), at scales that can produce sufficient illustration of concepts. co:List, as a doubly-linked list in RDF, is sufficiently cumbersome to write that programming support becomes more necessary to generate hand-written examples, which implies a need for library functions for developers.

Competencies demonstrated

Competency 1

A set of UcoObjects needs to behave as an ordered list, which is known to be complete, and has no ordering key other than insertion order within this list.

Competency Question 1.1

Can UCO represent this list?

Result 1.1

With CO, yes. See tests/examples/co_PASS.json, node kb:list-1.

Competency Question 1.2

Can one of the UcoObjects be in the list twice? (One could use this for, say, representing a known handoff sequence of some object where someone ferries the object multiple times.)

Result 1.2

Yes, this is a capability of co:Lists, subclass of co:Bag (aka a multiset).

Competency Question 1.3

Suppose the beginning and end of the list are known, but an item is missing from the middle. Can the order of known items still be queried?

Result 1.2

Yes. The total-ordering property co:nextItem records direct links between list members. The partial-ordering property x co:followedBy y indicates the co:ListItem y follows x, though after 1 or more co:nextItem links, exact count unknown. See tests/examples/co_PASS.json, node kb:list-2.

Competency 2

The Collections Ontology is provided as a Git repository on Github.

Competency Question 2.1

What is the current version of the CO? Is this the version that UCO is tracking as a Git submodule?

Result 2.1

The current version can be seen by visiting the CO Github page. The current version, identified by Git SHA-1, is 619e7b02646321174635fd04be658e338bf7d1d7.

The version tracked by UCO can be seen with this command:

$ git submodule status
 194c61523f98dfa8ae4338837737158d2636373f dependencies/CASE-Utility-SHACL-Inheritance-Reviewer (0.2.0)
 619e7b02646321174635fd04be658e338bf7d1d7 dependencies/collections-ontology (heads/master)

Solution suggestion

Coordination

ajnelson-nist commented 2 years ago

Discussion in today's OC meeting only made it through Risk 4. We will discuss the remaining risks at the next OC meeting. Feedback is encouraged in advance of the meeting.

cyberinvestigationexpress commented 2 years ago

After reviewing this proposal alongside #393 and the referenced CO2 paper, I agree that UCO should import the Collections Ontology to handle ordered lists.

The need for ordered lists and the solution provided by CO2 are compelling. The risks do not block this proposal. Although CO2 does not support non-linear message threads, UCO still requires representation of ordered lists (Risk 1). SHACL coverage, tooling support, inferencing, and maintaining documentation automatically are beyond the scope of this proposal (Risks 2, 4, 5, and 9), and are possible future development.

Regarding Risk 6, conflict with Facet strategy, any existing examples of MessageThread in UCO or CASE are notional and should be replaced with updated representations (including message.json example).

To address Risk 7, could we establishing a usage convention to use List and ListItem for ordered lists, and not the co:element equivalent property?

ajnelson-nist commented 2 years ago

Documentation impact

Resolving shape IRIs as URLs

There is a CASE/UCO-developed Python script that generates symbolic links to map IRIs to generated documentation pages. That script has not been adapted yet to link shape files that are not simultaneously OWL classes. Hence, shape-only IRIs will not resolve as expected, but they will on someone taking some time with updating that script.

This turns out to be a second blocker on IRI resolution; the first is that there is an unresolved error in the configuration of the documentation hosting server for UCO's IRI resolution. A ticket is being resolved with the provider to address this.

SHACL shape pages for shapes applied without targetClass

These appear to work. For instance, this is the generated Turtle snippet for uco-co:index-subjects-shape:

@prefix co: <http://purl.org/co/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix uco-co: <https://ontology.unifiedcyberontology.org/co/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

uco-co:index-subjects-shape a sh:PropertyShape ;
    sh:datatype xsd:positiveInteger ;
    sh:nodeKind sh:Literal ;
    sh:path co:index ;
    sh:targetSubjectsOf co:index .

SHACL shapes using sh:not

The documentation page for the shape using sh:not renders. Here is the generated page file's source, and here is the generated Turtle snippet:

@prefix co: <http://purl.org/co/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix uco-co: <https://ontology.unifiedcyberontology.org/co/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

uco-co:itemContent-subjects-shape a sh:NodeShape ;
    sh:not [ a sh:PropertyShape ;
            sh:class co:Item ;
            sh:description "This shape encodes in SHACL that the range of co:itemContent is the complement of co:Item."@en ;
            sh:path co:itemContent ] ;
    sh:property [ a sh:PropertyShape ;
            sh:description "This shape encodes in SHACL that co:itemContent is an OWL FunctionalProperty (giving the sh:maxCount constraint)."@en ;
            sh:maxCount 1 ;
            sh:nodeKind sh:BlankNodeOrIRI ;
            sh:path co:itemContent ] ;
    sh:targetSubjectsOf co:itemContent .

Odd blank nodes

There are two curious shape pages made due to what appears to be a bug in Ontospy:

The names are due to rdflib skolemizing a blank node with RAM addresses. It is possible that they should not have been produced in the generated documentation, being blank nodes.

They display a null implementation box, so it's difficult to determine what they are supposed to contain. They aren't navigable from the generated documentation tree, so this may only be a confusion point for anybody reviewing the source files tree in a local Git clone. It also turns out Github does not list them due to there being too many files in that directory.

Summary of impact

There appear to be no "showstopper"-grade issues with the documentation engine.

ajnelson-nist commented 2 years ago

The PR has been updated. The "root" uco.ttl file now imports the Collections Ontology shapes graph, and transitively imports the Collections Ontology, and a piece of the CI related to normalization was fixed.

ajnelson-nist commented 2 years ago

The UCO OC has had a standing, vaguely-specified question about risk assessments with respect to importing ontologies.

Issue 406 has just been posted to at least provide an answer to part of the risk assessments: Would importing this ontology cause UCO to become non-conformant with OWL 2 DL any more than it knows itself to be? (The remainder of the larger question, about modeling risks, remains out of scope of this comment and the linked issue.)

Issue 406 starts the answer to that question by defining SHACL shapes that review OWL 2 DL conformance. It is highly likely to be an incomplete review, but at least hits some of the significant issues UCO tries to prevent exposing its users to, including but not limited to:

There are two results of applying this review to the Collections Ontology:

  1. Nothing appears to be an OWL 2 DL non-conformance, according to the shapes defined in that uco-owl: SHACL test suite.
  2. The review of the transitive owl:imports closure of CO can be done, but the most practical way to do so with CI is to also track the error ontology as a Git submodule. If so tracked, this Makefile reviews co: and error:. This patch tracks error: as a submodule, and I suggest that it be incorporated into PR 390 as part of the Solutions Approval vote.
sbarnum commented 2 years ago

A few comments on various portions of the CP:

Risk4: I think we need to be very careful in any attempts to apply the change in SHACL property shape definition style as proposed under Risk4 in any broad way. From the PR it looks like it is only proposed for use within co.ttl for now and that seems fine. We should be very careful in attempting it outside of co.ttl. It should work fine for DatatypeProperties such as the string example but cannot be similarly universally applied for ObjectProperties as they require flexibility in sh:class assertions under different class contexts. For a literal such as a string you can always add additional local property shape properties such as the sh:maxCount in the example or others such as sh:minLength, sh:pattern, etc. You could do similar additional localized constraints for Integer based DatatypeProperties. For ObjectProperties you would not be able to define a universal shape asserting sh:class. A given property used with a given class may end up being a specific subclass of its general use on other classes. We need to be careful how broadly we look to apply such a new style pattern and be confident in our understanding of its effects.

Risk6: I do not see any conflict with Facet strategy. The description for this risk implies that there is inconsistency or error in how Facets are currently used. I would disagree with this assertion. Facets are classes characterizing some aspect of a UcoObject through the properties associated with them. They are a special form of structured concept classes (as described in the UCO design document (https://unifiedcyberontology.org/resources/uco_design_document.html) that are only ever used as the range of the core:hasFacet property on UcoObjects. This is used to convey characterization of particular aspects of UcoObjects in UCO currently and is intended to serve as a clean extension point for the specification of custom structured concept class characterizations of particular aspects of UcoObjects by third party users outside of the currently defined UCO spec. The confusion about apparent inconsistency asserted in the risk writeup can be easily explained in that the second example (observable:MessageThread) is in the observable namespace where the community has had a longstanding explicit consensus to support duck typing for observables and that this imparted the requirement that properties of ObservableObjects are always conveyed via relevant Facets. There should be no confusion and there is no unintentional inconsistency. If you wanted to convey a message thread as a CO class you would simply define a property like observable:messageThread on observable:MessageThreadFacet and have it use a CO class as its range. We don't have to implement a CO class as a Facet directly. This should have no impact on how Collections are used within UCO as they should be leveraged as the range of properties and not as Classes. I suspect attempting to specify UCO classes as CO classes has a significantly greater likelihood to run into semantic conflicts than using them as types for properties.

Risk7: I agree that we should avoid trying to change Compilation to align with CO due to some of the semantic complexities explicitly outlined in risk7 and more generally referenced in the end to my comment above on risk 6.

Risk8: I do not have a confident answer to this either though I suspect we would want to compile it into our monolithic build given that the purpose of that monolithic build is to convey the complete set of UCO such that it can be processed, analyzed and/or used without worrying about whether all parts are present and in the correct form.

ajnelson-nist commented 2 years ago

From voting today, we will include the error ontology patch.

ajnelson-nist commented 2 years ago

While updating example JSON-LD, I found an error I was somewhat expecting wasn't triggering.

For everyone's awareness - the Collections Ontology makes some requirements on certain more-stringent integer types (e.g. xsd:positiveInteger on co:index). pyshacl currently has an unresolved issue with resolving this, which is due to an upstream issue on RDFLib.

Effectively, on resolution of this PR, some data updates may need to be made to assign types according to CO requirements.

Meanwhile, there may be another patch added on top of the already-merged solution for UCO Issue 389, to add a sh:minInclusive statement so co:index does not get assigned the value 0 (whether typed as xsd:integer or xsd:positiveInteger). I believe this is in-scope of the approved solution, so I don't plan to open another vote on the matter, especially if it is added before the UCO 0.10.0 release.