ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.
Apache License 2.0
74 stars 34 forks source link

Resolve current bug in UCO that does not require globally unique IDs for all class objects #430

Closed sbarnum closed 1 year ago

sbarnum commented 1 year ago

Background

The following excerpted portion of the UCO Design Document (https://unifiedcyberontology.org/resources/uco_design_document.html) provides a summary overview of the various types of classes in UCO and how they work together.

"In the UCO RDFS/OWL/SHACL ontology, classes are defined for any relevant domain concept as well as for any structured concept characterizing some aspect of a domain concept. These are structured concept classes that specify into UcoObject classes, Facet classes and other classes. UcoObject and Facet classes therefore are structured concept classes, however, UcoObject classes and Facet classes are disjoint from each other. Moreover, Facet classes inhere in UcoObject classes; this implies that for a facet concept to exist, it is dependent on the existence of the UcoObject concept that bears the facet. For example, when destroying a red car, the car as bearer for the red color is removed and with it, its red color disappears. Note that the reverse is not true; UcoObjects are not existentially dependent on facets, and, thus, cannot inhere in them. Note further that, although the example suggests that facets are compulsary for UcoObject concepts, this is not the case. Domain concept classes (e.g., File, Action, Identity, Location, Device, etc.) are defined as subclasses of the UcoObject class. Facet classes characterize a particular pattern of properties that potentially apply for more than one domain class; a color, weight, an address and alike (described in Section 5 below) represent characteristics that apply not only for cars, but also for houses, persons, books and what have you. Domain concept classes represent the things whereas facet classes represent the thing’s characteristics. The disjointness between them follows from the fact that the thing can never be the same as its characteristics. All objects in UCO must specify a globally unique identifier (discussed in Section 4 below) and an assertion of the class type of the object."

The last line of the above excerpt is very important and highlights an overlooked bug in the current and past implementations of UCO. Currently, only UcoObject specifically codifies the core:id and core:type properties providing/requiring a globally unique identifier for each instance of the class. Without such a codification and requirement, subclasses of core:Facet or any other structured classes (core:ExternalReference, marking:GranularMarking, observable:MimePartType, etc) in UCO are simply treated as blank nodes with a locally (NOT globally) defined ID.

From the W3C wiki page (https://www.w3.org/wiki/BlankNodes) on blank nodes:

You can identify BlankNodes locally with a NodeId. that ID can be used to talk about the node inside your particular file/store of information, but you can't use it to ID the node externally.

This means that UCO content within a single file or produced within a single, uniform store of information has the potential to hang together in a coherent fashion but as soon as you attempt to merge or blend graphs from different files or information stores (a critical fundamental purpose for UCO) the graph falls apart as the lack of globally unique IDs on non-UcoObject class objects means that they lose coherence with the UcoObject they are part of. Local NodeIds are typically assigned by RDF processors following similar or identical algorithms for each set of content leading to a certainty of ID conflicts in merged content.

This is a critical bug that needs addressed.

Requirements

Requirement 1

Every individual instance of a UCO class must have a globally unique id

Requirement 2

Merged graphs of UCO content from different files, information stores or producers must maintain relational graph integrity where non-UcoObject class objects maintain unique and coherent relation to the UcoObjects they are an inherent part of.

Risk / Benefit analysis

Benefits

Content blended from multiple UCO graphs (a fundamental purpose of UCO) will be possible.

Risks

Increases each non-UcoObject class object by one property. Existing examples will need to be updated.

Competencies demonstrated

Competency 1

Maintain integrity of UCO content in merged graphs from multiple origins

Competency Question 1.1

Query a UcoObject containing inherent embedded class content (e.g. a File observable object containing a FileFacet with property content)

Result 1.1

Return the full UcoObject with all of the embedded (FileFacet) content with accuracy and integrity

Competency Question 1.2

Query a merged graph for multiple UcoObjects (from different origin graphs) containing inherent embedded class content.

Result 1.2

Return the full UcoObject swith all of the embedded (FileFacet) content with accuracy and integrity

Solution suggestion

[]
    a owl:AllDisjointClasses ;
    owl:members (
        array:ArrayOfAction
        tool:BuildConfigurationType
        # ... there are actually quite a lot ...
        core:Facet
        core:UcoObject
        # ...
    ) ;
    .

This proposed solution of utilizing a defined common base class for all UCO classes to specify the required globally unique ID for all classes is cleaner than simply adding core:id and core:type to each of the non-UcoObject classes in UCO. It is also easier to maintain and provides better coherence to the UCO class tree and cleans up much of the current messiness in the class hierarchy.

Examples

This simple example is from the same Section 3 of the UCO Design Document as the excerpt quoted in the Background section above:

{
  "@graph": [
    {
      "@id": "kb:person-952c09ff-5a38-483b-9dcf-6d8f0b27dfac",
      "@type": "identity:Person",
      "core:objectCreatedTime": {
        "@type": "xsd:dateTime",
        "@value": "2017-06-25T12:12:12.12Z"
      },
      "core:name": "John Smith",
      "core:hasFacet": [
        {
          "@id": "kb:5ecfbe78-e7c7-4b23-97fd-5ede9cc32123",
          "@type": "identity:SimpleNameFacet",
          "identity:givenName": "John",
          "identity:familyName": "Smith"
        }
      ]
    },
    {
      "@id": "kb:relationship-cecfbe8c-8357-4105-b448-b491177fedf2",
      "@type": "core:Relationship",
      "core:kindOfRelationship": "located-at",
      "core:source": "kb:person-952c09ff-5a38-483b-9dcf-6d8f0b27dfac",
      "core:target": "kb:location-7044bee0-d5d2-45f3-bb5d-2ced42bfd3f4"
    },
    {
      "@id": "kb:location-7044bee0-d5d2-45f3-bb5d-2ced42bfd3f4",
      "@type": "location:Location",
      "uco-core:hasFacet": [
        {
          "@id": "kb:69e9fe37-f2ee-435b-998f-7b1b0d60a405",
          "@type": "location:SimpleAddressFacet",
          "location:locality": "New York City",
          "location:region": "New York",
          "location:country": "USA",
          "location:street": "5th Ave"
        }
      ]
    }
  ]
}

Coordination

ajnelson-nist commented 1 year ago

I believe this proposal is strategically wrong and will file two proposals correcting underlying issues.

The short is core:id and core:type must be deleted due to conflicts with core RDF.

ajnelson-nist commented 1 year ago

Looking again, I now think only the parts of this proposal pertaining to core:id and core:type are wrong, on account of my belief that core:id and core:type are wrong to include in UCO at all. I am drafting those proposals still.

However, there is another piece that I think is missing from your solution suggestion. We allow sh:nodeKind sh:BlankNodeOrIRI on all of our object properties. I think this proposal is supposed to include instead using sh:nodeKind sh:IRI on most, if not all, of the object properties' shapes.

Last, I remember we had discussed this before in Jira, and I had asked you for an example and you might not have gotten a notice of the Jira comment. How would you represent a file that has a hash? I think that is going to be an essential sanity-check.

ajnelson-nist commented 1 year ago

@sbarnum : Also, if the top-most class in UCO would now be core:ClassBase, we should expand the disjoint statement between core:UcoObject and core:Facet to cover the other sibling subclasses of core:ClassBase. E.g., this axiom should now be included in core::

[]
    a owl:AllDisjointClasses ;
    owl:members (
        array:ArrayOfAction
        tool:BuildConfigurationType
        # ... there are actually quite a lot ...
        core:Facet
        core:UcoObject
        # ...
    ) ;
    .

It's actually a bit of a surprise when looking at what Protege displays as subclasses of owl:Thing.

sbarnum commented 1 year ago

@ajnelson-nist Good catch on changing sh:nodeKind sh:BlankNodeOrIRI to sh:nodeKind sh:IRI on ObjectProperty SHACL shapes. I had missed that implication.

Here is an example of a file with a hash:

{
  "@id": "kb:file-a0a69ece-da9c-4256-a9a8-5dec82a4ad1f",
  "@type": "uco-observable:File",
  "uco-core:hasFacet": [
    {
      "@id": "kb:ContentDataFacet-1e54fa5e-1399-476c-8aa7-00781b8c12db"
      "@type": "uco-observable:ContentDataFacet",
      "uco-observable:hash": [
        {
          "@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
          "@type": "uco-types:Hash",
          "uco-types:hashMethod": {
            "@type": "uco-vocabulary:HashNameVocab",
            "@value": "SHA256"
          },
          "uco-types:hashValue": {
            "@type": "xsd:hexBinary",
            "@value": "e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6"
          }
        }
      ]
    }
  ]
}
sbarnum commented 1 year ago

I have no objections to expanding the disjoint statement to include all classes that only have owl:Thing as a superclass (i.e. add in all of the classes that are neither subclasses of UcoObject or Facet).

ajnelson-nist commented 1 year ago

FYI, the observable:hash snippet has an error - the literals (@value-bearing) must not have @id.

sbarnum commented 1 year ago

I very fundamentally disagree with the assertion to remove core:id and core:type properties. I have added a comment to the related CP explaining why. All of the rationale I have seen to date for removing them is based on a presumption that JSON-LD and other RDF serializations are the only way to serialize UCO. This has not been the case since the beginning of UCO and CASE. JSON-LD is the default serialization but UCO should support any other serialization as well.

sbarnum commented 1 year ago

FYI, the observable:hash snippet has an error - the literals (@value-bearing) must not have @id.

Oops. I got id happy. LOL>

I will fix it. thanks

sbarnum commented 1 year ago

I fixed the example to remove my extraneously added ids.

sbarnum commented 1 year ago

I updated the CP to include the changes to the ObjectProperty SHACL shapes `sh:nodeKind' and the class disjoint statement.

sbarnum commented 1 year ago

I realized that our JSON-LD context should contain the following:

"core:id": "@id",
"core:type": "@type",

Rather than

"id": "@id",
"type": "@type",

In this way the plain json cleanly aligns to the ontology as expected and the context does the work of mapping those properties to @id and @type.

We can also add any documentation we want to the json-ld context file outside of the "context" definition object that documents details of our json-ld serialization. The processor will simply ignore the extra content.

I am going to make the above change to the json-ld context proposal.

ajnelson-nist commented 1 year ago
"core:id": "@id",
"core:type": "@type",

That breaks JSON-LD if core:id and core:type are owl:DatatypePropertys.

ajnelson-nist commented 1 year ago

All of the rationale I have seen to date for removing them is based on a presumption that JSON-LD and other RDF serializations are the only way to serialize UCO. This has not been the case since the beginning of UCO and CASE. JSON-LD is the default serialization but UCO should support any other serialization as well.

In terms of what UCO has committed to developing technologically for 1.0.0, JSON-LD is in scope, and we are trying very hard for JSON that is not JSON-LD. Other non-RDF syntaxes have not been presented as specific use cases.

ajnelson-nist commented 1 year ago
"core:id": "@id",
"core:type": "@type",

That breaks JSON-LD if core:id and core:type are owl:DatatypePropertys.

Further, @type must always be interpreted as rdf:type, and @id must always be interpreted as a node identifier. I don't think you appreciate that you are proposing completely breaking RDF functionality of JSON-LD with these properties.

ajnelson-nist commented 1 year ago

Re:

        {
          "@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
          "@type": "uco-types:Hash",
          "uco-types:hashMethod": {
            "@type": "uco-vocabulary:HashNameVocab",
            "@value": "SHA256"
          },
          "uco-types:hashValue": {
            "@type": "xsd:hexBinary",
            "@value": "e5ca3be56f66200a1bb2262e948ac08dbc672bc8033c1ada743787b0c667dea6"
          }
        }

This @id causes me some stomach pain as a developer. A UUID for every hash algorithm-value pair? I am aware of some systems that do indexing at potentially the level of every JSON @type-bearing (non-@value-bearing) object, so I appreciate that this might be necessary. I'd really hate to make another object that stores that same hash algorithm-value pair, though. The index load would feel pretty gross.

On the brighter side, if types:Hash objects could be shared, we might actually get query-time benefits from letting users use indexing on these types:Hash nodes' identifiers. Requiring UUIDv4s would keep UCO at its current level of being able to compute matching hash values: only by full comparison of the hash string value and method.

As a summary effect: I would like observable:hash to be protected from being a owl:InverseFunctionalProperty, perhaps with something like this update, changing the comment from:

Hash values of the data.

To:

A hash value of the data. As part of UCO OWL modeling, this property is intentionally neither an owl:FunctionalProperty, nor an owl:InverseFunctionalProperty.

May we expand the scope of this proposal to include this revision to observable:hash?

sbarnum commented 1 year ago

I think I may have discovered the root of our disconnect.

I just noticed that types:Identifier is currently only defined as a generic rdfs:Datatype with no further detail. This was never the intention. It was always intended to be a Datatype constraining the value of xsd:string with a regex for our agreed form of IRI value for an object identifier. We discussed this at length a few years back and I could have sworn we added it in to the definition of types:Identifier but it is obviously not there now. I don't know if we never finished that work or if it got put in and then pulled out at some point. At a minimum the defined constraint on string should be a regex for an IRI. More specifically it should constrain it to the UCO identifier pattern we developed that ensured global uniqueness and simply supported linked-data. It was "-" (this is the pattern we use in examples) where the UUID was at least v4 but eventually we would like to support v5 for autogeneration based on semantically relevant content of the object (this v5 approach would handle the hash reuse issue you describe above).

I think the issue is we need to complete the definition of types:Identifier as described above. Once that is done, I believe the rest of this CP should work unless I am completely missing something.

At that point types:Identifier is a string with particular value constraints. core:id has range of types:Identifier so is a string with particular value constraints (which ensure it is a valid IRI identifier). core:range already has a range of xsd:string and is defined as "The explicitly-defined type of characterization of a concept."

"core:id": "@id", "core:type": "@type",

in the json-ld context simply changes

"uco-core:id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
"uco-core:type": "uco-types:Hash",

to

"@id": "kb:hash-87c24a7f-a0d2-41a3-a726-0521a5c7bc8c",
"@type": "uco-types:Hash",

The value strings do not change at all. They are valid values of the core:id and core:type properties including core:id being range of types:Identifier. They are also valid in json-ld as the value of @id is a valid node identifier and the value of @type is a valid rdf:type string identifer.

Am I missing some other dimension to this or was the root of our disconnect the fact that types:Identifier is currently incompletely defined.

ajnelson-nist commented 1 year ago

You are still not understanding that trying to use this will break JSON-LD:

"core:id": "@id",
"core:type": "@type",

Please test that.

sbarnum commented 1 year ago

You are correct that

"core:id": "@id",
"core:type": "@type",

are invalid. I forgot that left-side keys in the context cannot be prefixed entities. I would not categorize this as breaking json-ld but it is definitely invalid syntax for the context and would throw errors in the json-ld processor. I changed the json-ld context CP back to the way it was.

I still have not seen any convincing argumentation/evidence that the presence of core:id and core:type "break" anything. What I have seen is an assertion that they may be confusing in regards to the bindings for these concepts to RDF which I would agree with.

The remaining challenge I see is how we express the requirements for these concepts/properties if we remove them. It is true that for RDF serializations the required rdf:type (@type in json-ld) is implicitly linked to the class for which the object is an individual of and that the subject of RDF triples are inherently IRI identifiers. And if we modify the sh:nodeKind for all ObjectProperties in UCO to be sh:IRI then we implicitly require object IDs to be IRIs and not blank nodes.

For other serializations these requirements and linkages are not implicit and we need a way to convey them. Further, without any explicit representation for an id property in the ontology how do we express the desired IRI formatting constraint for UCO identifiers?

While RDF/JSON-LD are the specific minimally targeted and fully supported serializations for 1.0.0 there is a significant difference between the intention to fully support other serializations and simply to not make decisions that block them. It has always been a fundamental principle of UCO that our serialization support is inclusive not exclusive. For 1.0.0 we are not going to fully flesh out serialization support beyond RDF/JSON-LD but we need to make sure we do not presume that these will be the only serializations for UCO and make design/implementation decisions that prevent other serializations from being practical.

If we can identify how we can do the following without the core:id and core:type properties then I am okay with removing them for 1.0.0:

  • assert that all objects in any serialization must have a globally unique identifier and an explicit assertion of type tied to a UCO class
  • assert the desired IRI formatting for UCO identifiers
ajnelson-nist commented 1 year ago

Re:

assert that all objects in any serialization must have a globally unique identifier and an explicit assertion of type tied to a UCO class

I feel this is an impossible requirement to satisfy a priori. I know of no enumeration of serialization formats broken out by whether they have an elementary structure of a node identifier or not. XML outside of RDF doesn't. YAML...I don't know.

For the targeted support serializations, which are based on RDF, core:id is a hindrance. It is a repetition of the "Subject" position of a triple. I admit RDF seems to have danced around not using the term "ID", and instead using "The subject of a triple." But in the RDF serialization, usage of core:id seems it should be actively discouraged, as it can only repeat, as a string serialization, the RDF-structural node identifier.

Re:

assert the desired IRI formatting for UCO identifiers

You can say "Desired," but it would be a complete information siloing act to say say "Required." If you require a format for node identifiers, UCO is incompatible with every application that predates UCO, where rdf:Resources serve as ideas. How would you, for instance, say that this IRI (which has a label familiar to this community) is also a UCO identity:Organization?

<http://www.wikidata.org/entity/Q2464882>
    rdfs:label "Netherlands Forensic Institute"@en .

(Edit: I'd initially copied the URL instead of the concept IRI. Now fixed here and below.)

If this next block of Turtle is invalid UCO because of that yet-unspecified types:Identifier, then UCO is an information silo and fails semantic web interoperability.

<http://www.wikidata.org/entity/Q2464882>
    a uco-identity:Organization ;
    rdfs:label "Netherlands Forensic Institute"@en .

I do not think it would be helpful for UCO to attempt prescribing any type of format for concept IRIs. I'd omitted removing the types:Identifier datatype in the core:id proposal, but if it is a more-harm-than-good concept to retain, I would also suggest deleting it.

Last, re: core:type - I believe this does not sufficiently differ from rdf:type to merit retaining. It also demonstrates, to me, a UCO willingness to invent, and re-invent, rather than adopt, which looks particularly fragmentative when what's being re-invented is a part of a specification already adopted as a foundational technology (RDF). Further, your YAML illustration made it seem likely to me other serializations of UCO would also need to support namespacing because of UCO's use of namspaces to house concepts with matching basenames (aka fragments, e.g. startTime in both core: and action:). If so, the rdf: prefix is just as available for use as UCO's several, so core:type appears, again, moot and incompatibly typed versus core RDF.

For RDF-based applications, I think this proposal's requirements on nodes bearing non-blank identifiers can be satisfied with sh:nodeKind sh:IRI being used in place of sh:nodeKind sh:BlankNodeOrIRI.

ajnelson-nist commented 1 year ago

@sbarnum , something else you should be aware of: Some JSON-LD serializers are likely to make every node that has an @id key into a "Top"-level (that is, not nested) JSON object in the @graph array. So, this proposal has an additional risk, that JSON-LD examples that are programmatically generated (such as some CASE examples) may be significantly more difficult to read by eye, due to Facets being at potentially far-flung regions of the file compared to their housing UcoObject.

I'm not actually 100% sure whether there is a technical solution to this yet, or if the problem has non-standard workarounds, but there is a specification that tries to say when some objects, even with @ids, should nest in one another. That standard is JSON-LD Framing, but it is currently only an Editor's Draft.

ajnelson-nist commented 1 year ago

Also, there is a slight error in some of the motivation for this proposal:

Local NodeIds are typically assigned by RDF processors following similar or identical algorithms for each set of content leading to a certainty of ID conflicts in merged content.

This is incorrect if remaining in the context of RDF processors sending data between one another. If a blank node is loaded, the RDF processor must generate a process-local identifier on reading. These two files would not cause a conflict if loaded into the same graph instance:

_:x rdfs:comment "I am node x." .
_:x rdfs:comment "I am node x." .

Yes, they are the same content to the eye, but the engine will assign a new (typically skolemized) random-ish identifier in place of _:x. The length of the total graph will be two distinct triples, not one repeated.

I believe there is next to no risk of ID conflicts when merging content. That said, there are other detractors to using blank nodes, because even when you see their name serialized like _:x in a file, you can't write code within an RDF engine to say "Describe _:x", so there is still good reason to require non-blank identifiers.

ajnelson-nist commented 1 year ago

@sbarnum - while reviewing UCO's Jira backlog, I came across OC-200 that runs through a whole list of things (many in the observable namespace) that have no parent class.

Rather than enumerate those classes here, I believe the solution of this proposal needs to incorporate the following SPARQL query into CI, failing CI if there are any finds other than your proposed top-level class.

SELECT ?nClass
WHERE {
    ?nClass a owl:Class .
    FILTER NOT EXISTS {
        ?nClass rdfs:subClassOf ?nOtherClass .
    }
}

That query should be run against the monolithic build of UCO (a temporary artifact of the CI workflow under /tests), after deleting (from an in-memory copy) all triples of the form x rdfs:subClassOf owl:Thing .

ajnelson-nist commented 1 year ago

Also, a style matter, more artistic opinion than technical issue:

core:ClassBase feels like a heavily object oriented programming oriented term, and awkward as a top-level class vs. core:UcoObject. May we borrow a name pattern from OWL, and call UCO's top-level class core:UcoThing instead of ClassBase?

ajnelson-nist commented 1 year ago

As a further argument for core:UcoThing over core:ClassBase: Verbalizing.

"Here in my graph, I have X, a UCO types hash, which is also a UCO core class base, which is also an OWL thing."

Versus:

"Here in my graph, I have X, a UCO types hash, which is also a UCO core UCO thing, which is also an OWL thing."

sbarnum commented 1 year ago

I agree on having a CI SPARQL check to ensure all classes have defined superclasses.

I also do not object to core:UcoThing.

sbarnum commented 1 year ago

I state with the certainty of experience that blank nodes WILL cause integrity issues when merged into a graph store.

Unique IRI's are required for all objects.

ajnelson-nist commented 1 year ago

@sbarnum : you made a few claims in yesterday's meeting, about blank node behaviors, that did not agree with my understanding of some specification---I assume RDF's---and how blank nodes behave when consumed by multiple tools. That is one of the key motivators for this proposal, and your citation chain currently stops at "[your] experience."

Part of the solution for this proposal will be implementing this query as part of a SHACL-SPARQL constraint:

SELECT ?nThing                                                                                                         
WHERE { 
        ?nThing a/rdfs:subClassOf* uco-core:UcoObject .                                                                
        FILTER (
                ! REGEX (
                        STR(?nThing),
                        "[0-9a-f]{8}-[0-9a-f]{4}-[0-5][0-9a-f]{3}-[0-9a-f]{4}-[0-9a-f]{12}$",                               
                        "i"
                )
        )
}

(That will be adapted to use uco-core:UcoThing. I gave the query above to @gwebb-case for his assistance with our examples' UUID review.)

I believe this is a pretty significantly CPU-expensive query to compute, and person-expensive query to review when a use case justifies using an IRI form that does not end with UUIDs. I would strongly prefer its usage be justified by more than "Your experience."

Can you please provide, for the understanding of users downstream who come to UCO complaining about the runtime or log-volume of this review rule:

  1. The section of the RDF or RDFS spec that you've seen tools use to collide blank nodes.
  2. If possible, a technology demonstration of some tool that collides blank node identifiers, using these two graph files:
_:x <http://www.w3.org/2000/01/rdf-schema#comment> "I am anonymous-node x." ;
_:x <http://www.w3.org/2000/01/rdf-schema#comment> "I am ANOTHER anonymous-node x." ;

I had expected any RDF 1.1-conformant tool that loads those two files would have two independent subjects with one comment each, not one subject with two comments. I haven't seen rdflib or rdf-toolkit do this.

sbarnum commented 1 year ago

I do not have cycles to provide a technology demonstration of some tool as I am on family vacation from early morning Saturday through Thursday and am frantically working to finish 100 things before heading out. Already in hot water for having to dial in to the Thursday meeting when supposed to be doing last day in NY with family. I don't think it should be necessary though as the blank nodes sections of the RDF 1.1 spec is fairly clear on the matter. The NOTE in section "3.4 Blank Nodes" of the spec states:

Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes. Blank node identifiers are not part of the RDF abstract syntax, but are entirely dependent on the concrete syntax or implementation. The syntactic restrictions on blank node identifiers, if any, therefore also depend on the concrete RDF syntax or implementation. Implementations that handle blank node identifiers in concrete syntaxes need to be careful not to create the same blank node from multiple occurrences of the same blank node identifier except in situations where this is supported by the syntax.

I bolded a few portions above. Blank nodes are "always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes". While some tooling may attempt to help in such situations the spec specifically calls out that this is not a requirement. Many tooling implementations do not handle this either because they never tried or because they found that the solutions attempted by others are actually incorrect or problematic in different ways.

In section "3.5 Replacing Blank Nodes with IRIs" it states:

Blank nodes do not have identifiers in the RDF abstract syntax. The blank node identifiers introduced by some concrete syntaxes have only local scope and are purely an artifact of the serialization.

The spec does not dictate how to handle blank nodes outside of local scopes. Differing concrete syntax implementations may or may not handle them as serialization-dependent locally scope things while others simply may not. The spec explicitly warns not to assume they have integrity outside of a local scope. Section 3.5 then explicitly provides a suggested approach to overcoming this limitation for situations like ours where stronger identification is needed:

In situations where stronger identification is needed, systems may systematically replace some or all of the blank nodes in an RDF graph with IRIs. Systems wishing to do this should mint a new, globally unique IRI (a Skolem IRI) for each blank node so replaced.

This is what we are proposing to do.

ajnelson-nist commented 1 year ago

@sbarnum : This patch introduces core:UcoThing and starts an adjustment to core:UcoObject. Can you please provide definition rdfs:comments that let these two classes work?

Also, there are around 40 different classes that currently don't have a parent and will go under core:UcoThing. It might be erroneous to call them all disjoint classes. Can you please provide a class name for them, so the disjoint subclasses of UcoThing would be only UcoObject, Facet, and this other class? As best as I can tell from the definition of core:UcoObject, the only thing unifying these is that they are (probably?) not:

  • fundamental concepts
  • inter-relatable

The current parent-less classes, per the unit test query and cutting away Facet and UcoThing, are:

?nClass
0 https://ontology.unifiedcyberontology.org/uco/action/ArrayOfAction
1 https://ontology.unifiedcyberontology.org/uco/core/ExternalReference
4 https://ontology.unifiedcyberontology.org/uco/marking/GranularMarking
5 https://ontology.unifiedcyberontology.org/uco/marking/MarkingModel
6 https://ontology.unifiedcyberontology.org/uco/observable/ContactAddress
7 https://ontology.unifiedcyberontology.org/uco/observable/ContactAffiliation
8 https://ontology.unifiedcyberontology.org/uco/observable/ContactEmail
9 https://ontology.unifiedcyberontology.org/uco/observable/ContactMessaging
10 https://ontology.unifiedcyberontology.org/uco/observable/ContactPhone
11 https://ontology.unifiedcyberontology.org/uco/observable/ContactProfile
12 https://ontology.unifiedcyberontology.org/uco/observable/ContactSIP
13 https://ontology.unifiedcyberontology.org/uco/observable/ContactURL
14 https://ontology.unifiedcyberontology.org/uco/observable/EnvironmentVariable
15 https://ontology.unifiedcyberontology.org/uco/observable/ExtractedString
16 https://ontology.unifiedcyberontology.org/uco/observable/GlobalFlagType
17 https://ontology.unifiedcyberontology.org/uco/observable/IComHandlerActionType
18 https://ontology.unifiedcyberontology.org/uco/observable/IExecActionType
19 https://ontology.unifiedcyberontology.org/uco/observable/IShowMessageActionType
20 https://ontology.unifiedcyberontology.org/uco/observable/MimePartType
21 https://ontology.unifiedcyberontology.org/uco/observable/TaskActionType
22 https://ontology.unifiedcyberontology.org/uco/observable/TriggerType
23 https://ontology.unifiedcyberontology.org/uco/observable/URLHistoryEntry
24 https://ontology.unifiedcyberontology.org/uco/observable/WhoisRegistrarInfoType
25 https://ontology.unifiedcyberontology.org/uco/observable/WindowsPEFileHeader
26 https://ontology.unifiedcyberontology.org/uco/observable/WindowsPEOptionalHeader
27 https://ontology.unifiedcyberontology.org/uco/observable/WindowsPESection
28 https://ontology.unifiedcyberontology.org/uco/observable/WindowsRegistryValue
29 https://ontology.unifiedcyberontology.org/uco/pattern/PatternExpression
30 https://ontology.unifiedcyberontology.org/uco/tool/BuildConfigurationType
31 https://ontology.unifiedcyberontology.org/uco/tool/BuildInformationType
32 https://ontology.unifiedcyberontology.org/uco/tool/BuildUtilityType
33 https://ontology.unifiedcyberontology.org/uco/tool/CompilerType
34 https://ontology.unifiedcyberontology.org/uco/tool/ConfigurationSettingType
35 https://ontology.unifiedcyberontology.org/uco/tool/DependencyType
36 https://ontology.unifiedcyberontology.org/uco/tool/LibraryType
37 https://ontology.unifiedcyberontology.org/uco/types/ControlledDictionary
38 https://ontology.unifiedcyberontology.org/uco/types/ControlledDictionaryEntry
39 https://ontology.unifiedcyberontology.org/uco/types/Dictionary
40 https://ontology.unifiedcyberontology.org/uco/types/DictionaryEntry
41 https://ontology.unifiedcyberontology.org/uco/types/Hash

If you need to provide the definitions in just a Github comment, I'm happy to transcribe them as I build and test this patch series.

ajnelson-nist commented 1 year ago

An implementation for this proposal has been posted, except for adjustments that need to be done to all of the JSON-LD examples inlined in UCO. CI will fail on the examples until that is done. It's unfortunately not an entirely mechanical process due to node identifiers being duplicated without semantic intent across some of the examples. UCO users may be interested in this script that I'll be using on the CASE website CASE-Examples repository.

https://github.com/ajnelson-nist/CASE-Examples-QC/blob/main/src/issue_430_conformance_sed.py

Thanks to a certain community member who has already done some of the legwork of IRI conversion for CASE-Examples.

PR 467 has the implementation commits, whose documentation (Git log messages) I strongly suggest the committee review. The one that I think will generate the most eyebrow-raise is b491c21. Another effect of OWL conformance is that references to IRIs that need to be treated as class IRIs (such as in an owl:AllDisjointClasses) need to be designated as classes within the OWL transitive import closure. The UCO Core namespace doesn't import anything, so all of the disjoint classes need to be called classes in the Core namespace.

I personally think it'd be worth including some rdfs:isDefinedBy links to the respective "downstream from Core" ontologies, e.g.:

types:Hash
    a owl:Class ;
    rdfs:isDefinedBy <https://ontology.unifiedcyberontology.org/uco/types>
    .

Feedback is welcome on how to make those free-floating x a owl:Class appear less like afterthoughts to those who don't use git blame to see the lengthy logged description.

I will now be switching attention to other proposals, returning to implement the examples' IRI updates after the agenda for tomorrow is settled.

sbarnum commented 1 year ago

I would strongly propose that we do NOT want those class assertions in the core namespace as it violates the intended purpose for separating the various namespaces in the first place which is primarily to support granular adoption of UCO where a given adopting application domain or user could choose to utlize the portions relevant to them without needing to worry about the rest. It looks like the only reason that this OWL conformance question is relevant because of the intent to declare all the classes as disjoint. I would assert that this disjoint declaration is a convenience rather than a requirement and that the potential effect outlined above makes it a convenience not worth the price. I would propose that we remove the disjoint declaration and all of the extraneous cross-namespace class declarations.

If we absolutely must have some sort of disjoint statement (which I do not believe we need) then rather than the above tainting of the core namespace and UCO granularity we should simply define a parent class for all of the UcoThings other than UcoObject, make Facet a subclass of this new class and place it within core. That way the disjoint declaration can be only for UcoObject and the new class and it would not force tainting of core. If we had to do this then I would propose something like core:UcoInherentCharacterizationThing with a definition of "A UCO inherent characterization thing is a grouping of characteristics unique to a particular inherent aspect of a UCO domain object." and modify the definition of core:Facet to be something like "A facet is a grouping of characteristics singularly unique to a particular inherent aspect of a UCO domain object." The key differences between Facets and other InherentCharacterizationThings is that Facets are: 1) only ever associated to a UcoObject with the core:hasFacet property, 2) the core:hasFacet property can have a cardinality >1 but cannot include >1 instance of a particular Facet subclass. Once again, I feel this is an unnecessary complication of the class hierarchy and that a better solution would be to remove the convenience disjoint assertion but if it must stay then the above approach appears to be more effective with less downside.

ajnelson-nist commented 1 year ago

I think UCO needs to make further use of disjointedness statements. Right now (in develop), this is permitted by UCO's encoding, and should be flagged in as "upstream" a manner as possible:

kb:thing
  a
    core:File ,
    identity:Person ,
    types:ControlledDictionary
    .

Let's at least get types:ControlledDictionary out of that mix with this new near-top-level class. core:File not being an identity:Person, we'll leave for a later discussion.

Thank you for the class name and definition, @sbarnum , I will mix it in and revert the change that added all of the downstream-namespace classes.

sbarnum commented 1 year ago

ok. Thanks. I am a little unclear what you mean by

Let's at least get types:ControlledDictionary out of that mix with this new near-top-level class.

as all classes (including types:ControlledDictionary) defined in UCO would be UcoThing and all classes other UcoThing other than UcoObject would be UcoInherentCharacterizationThing. In this case it is just a more generally structured "grouping of characteristics unique to a particular inherent aspect of a UCO domain object." Such as its use for the observable:exifData property of the observable:EXIFFacet

ajnelson-nist commented 1 year ago

@sbarnum , I have reviewed the entire comment thread and have not found a definition of UcoThing (née ClassBase) that you said was in here.

The current definition that I wrote is about as non-committal as I'd like it to be:

UcoThing is the top-level class within UCO.