ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.
Apache License 2.0
76 stars 34 forks source link

Remove core:id #431

Closed ajnelson-nist closed 1 year ago

ajnelson-nist commented 2 years ago

Background

UCO defines a concept in the core namespace called id, a datatype property with range of uco-types:Identifier. uco-types:Identifier is only defined as an rdfs:Datatype, with no further definition.

A concept with fragment id risks significant conflicts with RDF technologies, particularly JSON-LD. There is also nothing that at least the proposer can discern within the definition of core:id that justifies UCO inventing a concept that is present (albeit without an explicit name) in the foundation of RDF.

Requirements

Requirement 1

core:id must be removed.

Risk / Benefit analysis

Benefits

Risks

No risk is known to the proposer, on account of no observed usage in the UCO or CASE ontologies, or in any CASE example.

The proposer believes, from the following described risks, that it is a greater risk to UCO to retain core:id than to delete it.

From conversations, meetings, and comment threads since 2019, the original motivations of core:id seem to the proposer to have been the following:

  1. To anchor a mechanism, designed for JSON-LD, so the @id key could be aliased to id in UCO JSON-LD. (The @ character as a dictionary key causes problems in some JSON engines, such as programming languages where JSON is a first-order construct and @ is not a legal name character.)
  2. To define a constraining format for UCO-borne concepts. E.g., any or all individuals typed as a UcoObject subclass could be required to follow an identifier format ending with an IRI, such as CASE recommends as a practice.

No other purpose has been recorded, or to the proposer's knowledge, discussed.

Unfortunately, neither purpose is appropriately served through defining an ontology concept of core:id.

On aliasing @id: This is a problem to solve in engineering and documentation that is specific to JSON-LD, not through concept definition in the serialization-independent ontology.

Further, it looks highly fragile, and possibly fundamentally incompatible with RDF, to attempt to bind core:id as an alias to JSON-LD's @id. core:id is currently defined as a owl:DatatypeProperty, meaning its range is a Literal (and, stranger, a stub literal-type that might or might not be a xsd:string by default). If that binding to @id succeeded, all data linkage would cease to work, because suddenly this:

{
  "@id": "ex:something",
  "rdfs:comment": "A string describing ex:something"
}

is illegal RDF. RDF cannot annotate strings in this manner. A graph engine, seeing this, would either silently drop the attempted triple, or raise a parse exception.

On defining a constraining format for UCO-borne concepts: This is an idea for a significant, but not total, usage profile of UCO. For knowledge graphs that start entirely from using UCO classes and properties, it is reasonable to suggest a form of IRI. However, UCO should also be usable to add extension annotations on IRIs borne elsewhere - for instance, to annotate the IRI mailto:test@example.org as having the type observable:EmailAddress. UCO is fundamentally RDF, and RDF is about expanding description of IRIs, whether or not the IRIs are in an original domain of interest. Thus, total enforcement of IRI form is impossible, beyond the original constraints in the RDF specification, Concepts and Abstract Syntax.

Competencies demonstrated

Competency 1

Note that none of these competency questions require the use of core:id.

Competency Question 1.1

What are the identifiers of all nodes of type observable:File?

Result 1.1

See all returned values of ?nNode.

SELECT ?nNode
WHERE {
  ?nNode a observable:File .
}

Competency Question 1.2

What are the identifiers of all non-blank nodes of type observable:File?

Result 1.2

See all returned values of ?nNode.

SELECT ?nNode
WHERE {
  ?nNode a observable:File .
  FILTER ( isIRI(?nNode) )
}

Competency Question 1.3

What are the identifiers of all non-blank nodes that do not end with a UUID?

SELECT ?nIRI
WHERE {
  ?nIRI a ?nType .
  FILTER (
    isIRI(?nIRI) &&
    !regex(
      STR(?nIRI),
      "[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}$",
      "i"
    )
  )
}

Solution suggestion

Delete core:id.

Coordination

sbarnum commented 2 years ago

The core:id and core:type properties are not extraneous or locally invented concepts in UCO.

An id and a type are necessary for every object in the graph for the graph to cohere and have integrity. This holds true for all serializations of UCO. If any given serialization dropped either of these from any object then its content would not be able to be deserialized or cross-serialized to another serialization with any integrity.

It is true that RDF serializations such as JSON-LD that are inherently graph-based recognize this requirement and enforce it by default. However, the implicit mapping of the @type to the owl:Class of the object is a JSON-LD binding rule rather than anything explicit in the ontology itself. And there is no mapping whatsoever of the @id to anything specific in the ontology. It is simply required by RDF serializations. A producer must explicitly assert it or in the case of "blank nodes" the rdf processor will autogenerate something locally (but not globally) unique. Again, this is all part of the rdf serialization side and not the ontology side.

For any non-rdf based serializations we cannot presume such binding rules apply. Serializing to YAML, for example, has no requirements for id or type on objects and if the ontology did not provide the core:id and core:type properties it would be unclear and impractical to recognize that the objects could/should be adorned with them.

For JSON-LD serialization bindings the core::id property is serialized as @id and the core:type property is serialized as @type so you do not have to have @id & @type properties as well as duplicate core:id and core:type properties within the object. This is why core:id currently has a maxCount=1 but neither has a minCount to enable non-duplicative use for serializations like JSON-LD.

Net-Net is that we cannot presume that just because one serialization form (JSON-LD), even if it is our default form, handles id & type implicitly that other serializations don't require an explicit codification of these properties in the ontology. core:id and core:type are relevant and necessary.

ajnelson-nist commented 2 years ago

@sbarnum Please provide a technology demonstration of what you mean. I continue to believe core:id and core:type are superfluous re-inventions, and are further incompatible with JSON-LD on an RDF level.

sbarnum commented 2 years ago

I am unsure what you mean by a technology demonstration but here are three example serializations of the same simple file with a single hash that represent the same UCO content.

JSON-LD serialization:

{
  "@id": "kb:file-a0a69ece-da9c-4256-a9a8-5dec82a4ad1f",
  "@type": "uco-observable:File",
  "uco-core:hasFacet": [
    {
      "@id": "kb:ContentDataFacet-1e54fa5e-1399-476c-8aa7-00781b8c12db"
      "@type": "uco-observable:ContentDataFacet",
      "uco-observable:hash": [
        {
          "@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
          "@type": "uco-types:Hash",
          "uco-types:hashMethod": {
            "@type": "uco-vocabulary:HashNameVocab",
            "@value": "SHA256"
          },
          "uco-types:hashValue": {
            "@type": "xsd:hexBinary",
            "@value": "e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6"
          }
        }
      ]
    }
  ]
}

Simple JSON representation (any producer or consumer could take this and apply the official json-ld context to convert it to json-ld but they may also want to just leave it as is):

{
  "uco-core:id": "kb:file-a0a69ece-da9c-4256-a9a8-5dec82a4ad1f",
  "uco-core:type": "uco-observable:File",
  "uco-core:hasFacet": [
    {
      "uco-core:id": "kb:ContentDataFacet-1e54fa5e-1399-476c-8aa7-00781b8c12db"
      "uco-core:type": "uco-observable:ContentDataFacet",
      "uco-observable:hash": [
        {
          "uco-core:id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
          "uco-core:type": "uco-types:Hash",
          "uco-types:hashMethod": {
            "uco-core:type": "uco-vocabulary:HashNameVocab",
            "uco-core:value": "SHA256"
          },
          "uco-types:hashValue": {
            "uco-core:type": "xsd:hexBinary",
            "uco-core:value": "e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6"
          }
        }
      ]
    }
  ]
}

YAML:

---
uco-core:id : kb:file-a0a69ece-da9c-4256-a9a8-5dec82a4ad1f,
uco-core:type : uco-observable:File,
uco-core:hasFacet :
  - uco-core:id : kb:ContentDataFacet-1e54fa5e-1399-476c-8aa7-00781b8c12db
    uco-core:type : uco-observable:ContentDataFacet,
    uco-observable:hash :
      - uco-core:id : kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c,
        uco-core:type : uco-types:Hash,
        uco-types:hashMethod :
          uco-core:type : uco-vocabulary:HashNameVocab,
          uco-core:value : SHA256
        uco-types:hashValue :
          uco-core:type : xsd:hexBinary,
          uco-core:value : e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6

All of these should be valid UCO and represent the same thing. With the json-ld the producer is required to know each object requires an @id and @type but without an explicit core:id and core:type in the ontology they are doing this implicitly outside of the explicit ontology. With the simple json, the explicit core:id and core:type in the ontology clearly convey how they can represent these necessary properties in the content whether or not they choose to utilize the json-ld context to transform to json-ld. With the yaml, the explicit core:id and core:type in the ontology clearly convey how they can represent these necessary properties in the content.

It is true that our current SHACL shape validation will only work against an RDF serialization such as JSON-LD, however it should be possible to deserialize or cross-serialize any of these serializations to such a form for validation.

Without the explicit core:id and core:type properties in the ontology, there is no explicit construct for the concepts of id and type for serializations to utilize. We cannot presume that just because some serializations (such as JSON-LD) implicitly presume them that we do not need them in general.

core:id and core:type are not incompatible with JSON-LD. As the above examples show and with the changes I made to the json-ld context CP (mapping "core:id: to "@id" and "core:type" to @type) the alignment is explicit and simple.

ajnelson-nist commented 2 years ago

Replace this line in your YAML:

uco-core:type : uco-observable:File,

with this line:

rdf:type : uco-observable:File,

And you will once again be compatible with RDF.

I don't know YAML well enough to know how it handles the identity of a node. However, if you really wanted YAML to be a supported serialization, you should have integrated YAML-based testing into the ontology's test suite as a demonstration that UCO is capable of supporting it. Nobody has brought forth such a test case, and with 1.0.0 slated for Aug 30, I will be highly surprised if such a case comes up before Monday.

You continue to propose breaking JSON-LD as an RDF serialization, and removing any ability of UCO to interoperate with any other RDF data. rdfs:subClassOf pertains to rdf:type, and has no idea what core:type is. core:type remains a superfluous reinvention of a concept from core:rdf with an incompatible property-typing, and core:id supports no serialization language that was on the UCO development roadmap. The 1.0.0 roadmap only includes JSON-LD and JSON.

If you want to restore core:id after 1.0.0, please demonstrate the technological need with a passing CI test using another language. Until such a test arises, core:id and your suggested uses of it would reduce UCO's functional serialization languages from JSON plus all of RDF to only a schema-less JSON---i.e. zero interoperability.

ajnelson-nist commented 2 years ago

@sbarnum , here is a functioning demonstration of why you cannot substitute a UCO, or any, concept for @id and @type.

Test program:

#!/usr/bin/env python3

# This software was developed at the National Institute of Standards
# and Technology by employees of the Federal Government in the course
# of their official duties. Pursuant to title 17 Section 105 of the
# United States Code this software is not subject to copyright
# protection and is in the public domain. NIST assumes no
# responsibility whatsoever for its use by other parties, and makes
# no guarantees, expressed or implied, about its quality,
# reliability, or any other characteristic.
#
# We would appreciate acknowledgement if the software is used.

"""
This script demonstrates issues with using the JSON-LD Context
Dictionary to alias @id and @type to independently-developed concepts.
"""

import rdflib

def main() -> None:
    data_without_context = """{
    "id": "kb:x",
    "type": "Action",
    "result": "kb:y"
}"""

    # Example 1: Using full IRIs, except for kb: prefix, and letting id
    # and type function as @id and @type.
    context_1 = {
        # Namespace prefixes
        "kb": "http://example.org/kb/",
        # Classes
        "Action": "https:///ontology.unifiedcyberontology.org/uco/action/Action",
        # Properties
        "result": {
            "@id": "https:///ontology.unifiedcyberontology.org/uco/action/result",
            "@type": "@id",
        },
        # JSON-LD structures
        "id": "@id",
        "type": "@type",
    }

    graph_1 = rdflib.Graph()
    graph_1.parse(data=data_without_context, format="json-ld", context=context_1)
    graph_1.serialize("out_1.ttl", format="turtle")

    # Example 2: Substituting a UCO concept, by full IRI, for core
    # JSON-LD structural annotations.
    context_2 = {
        # Namespace prefixes
        "kb": "http://example.org/kb/",
        # Classes
        "Action": "https:///ontology.unifiedcyberontology.org/uco/action/Action",
        # Properties
        "result": {
            "@id": "https:///ontology.unifiedcyberontology.org/uco/action/result",
            "@type": "@id",
        },
        # JSON-LD structures
        "https:///ontology.unifiedcyberontology.org/uco/core/id": "@id",
        "https:///ontology.unifiedcyberontology.org/uco/core/type": "@type",
    }

    graph_2 = rdflib.Graph()
    graph_2.parse(data=data_without_context, format="json-ld", context=context_2)
    graph_2.serialize("out_2.ttl", format="turtle")

    # Example 3: As with Example 2, but with UCO namespace-prefixes added.
    context_3 = {
        # Namespace prefixes
        "action": "https:///ontology.unifiedcyberontology.org/uco/action/",
        "core": "https:///ontology.unifiedcyberontology.org/uco/core/",
        "kb": "http://example.org/kb/",
        # Classes
        "Action": "action:Action",
        # Properties
        "result": {"@id": "action:result", "@type": "@id"},
        # JSON-LD structures
        "core:id": "@id",
        "core:type": "@type",
    }

    graph_3 = rdflib.Graph()
    graph_3.parse(data=data_without_context, format="json-ld", context=context_3)
    graph_3.serialize("out_3.ttl", format="turtle")

    # Example 4: As with Example 3, but trying to specify that core:id
    # and core:type should be object properties instead of potentially
    # being interpreted as datatype properties.  Further, in case order
    # of the @id specification causes an influence, define core:type
    # before core:id.
    context_4 = {
        # Namespace prefixes
        "action": "https:///ontology.unifiedcyberontology.org/uco/action/",
        "core": "https:///ontology.unifiedcyberontology.org/uco/core/",
        "kb": "http://example.org/kb/",
        # Classes
        "Action": "action:Action",
        # Properties
        "result": {"@id": "action:result", "@type": "@id"},
        # JSON-LD structures
        "core:type": {"@id": "@type", "@type": "@id"},
        "core:id": {"@id": "@id", "@type": "@id"},
    }

    graph_4 = rdflib.Graph()
    graph_4.parse(data=data_without_context, format="json-ld", context=context_4)
    graph_4.serialize("out_4.ttl", format="turtle")

if __name__ == "__main__":
    main()

This shell transcript of running this program shows only the first form generates output that preserves the node identifier and RDF type. (A virtual environment is active for the purposes of having access to rdflib.)

(venv) $ head out_*.ttl
==> out_1.ttl <==
@prefix kb: <http://example.org/kb/> .
@prefix ns1: <https:///ontology.unifiedcyberontology.org/uco/action/> .

kb:x a ns1:Action ;
    ns1:result kb:y .

==> out_2.ttl <==
@prefix kb: <http://example.org/kb/> .
@prefix ns1: <https:///ontology.unifiedcyberontology.org/uco/action/> .

[] ns1:result kb:y .

==> out_3.ttl <==
@prefix action: <https:///ontology.unifiedcyberontology.org/uco/action/> .
@prefix kb: <http://example.org/kb/> .

[] action:result kb:y .

==> out_4.ttl <==
@prefix action: <https:///ontology.unifiedcyberontology.org/uco/action/> .
@prefix kb: <http://example.org/kb/> .

[] action:result kb:y .

This is because the remaining forms break the RDF-level parsing of JSON-LD - what would have been rdf:type becomes a string (even despite me trying in example 4 to specify it should be a node-reference), and the node identifier is lost because the functionality of @id referencing the subject of the RDF triple has been lost to a substitution of a new property. I admit the action:result triple being preserved surprised me, as I'm not sure how that blank-node subject is being generated.

After this experiment, my recommendation stands, that core:id and core:type be deleted, especially to prevent temptation to make this node-anonymizing and type-destroying substitution.

ajnelson-nist commented 2 years ago

Solution has been evaluated in PR #458 .