ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.
Apache License 2.0
73 stars 34 forks source link

Allow to circumvent identifying UcoThing's through UUID enforcement for digital resource data #606

Open plbt5 opened 1 month ago

plbt5 commented 1 month ago

(Submitted by @plbt5 and @ajnelson-nist.)

Background

Currently, UCO must allow for working with data from many different organisations. In order to not enter into a conflict for data uniqueness, it has been decided that data that are to enter the digital CDO-realm must be uniquely identified by an IRI that ends with a UUID. This is being enforced by a SHACL rule applied on each instance of UcoThing to verify the presence of an ending UUID syntax in its URI.

Unfortunately, this rule does not account for data that effectively represent a digital web resource, e.g., <https://caseontology.org/index.html>. For this category of data, adding a UUID to the IRI is invalidating the identification of the web resource, whereas leaving the UUID out would trigger a complaint. Both strategies are impediments to the data interoperability purpose of CDO/UCO/CASE. Moreover, adding a UUID to data that represent a web resource is defeating the point of the rule since web resources are required to be unique in order to be resolvable into a web location. Hence this change request.

We remind us of the distinction that has already been identified by RDF between information and non-information resources. This ended up in RFC9110 HTTP Semantics. We have depicted their application and distinctions in Figure 1 below:

Distinction between URIs/URN/URL and Information Resources versus Non-information Resources Figure 1 - Information and non-information resources: their relationship and differences

(Note: For the purposes of this proposal, please consider URI and IRI as synonymous.)

The distinction between an Information Resource (IR) and Non-Information Resource (NIR) cannot be determined from the URI itself but from the response that one gets from the server. If the URI concerns a NIR the server cannot respond with data because there does not yet exists something like Elephant-Over-IP or Paul-Over-IP a.k.a. "Beam me up, Scotty" in the protocols. Instead, the server will respond with a HTTP-303 status, redirecting to a URI that is an Information Resource. Visiting the NIR thus discloses information about the NIR as opposed to the real thing itself.

This distinction is instrumental for a lot of things that are built with RDF(s) and OWL, and it is something that UCO should at least recognize as current practice.

Requirements

Requirement 1

Allow self-identifying information resources to reside in the graph without any additions to, or changes in, their already unique identification.

Requirement 2

Allow to assert the distinction between NIR's and IRs, or allow that a resource can be both.

A resource can be both an IR and a NIR because it can be perceived as an IR or NIR depending on constraints or business rules as implemented by the server, e.g., serving pages in different languages when requested from different geographical locations.

Requirement 3

A single web resource MUST be able to be represented as an IR or an NIR as appropriate at different times, when analysing a cyber incident such as a domain-hijacking.

Solution suggestion

One solution is to apply reification on the digital resources' identifiers as URLs in order to incorporate them into a UCO knowledge graph. And although this might suffice, it seems a bit wasteful to first obfuscate a valid URI for a digital resource into a UUID-ending URI, followed by a pattern to elucidate it again.

We take the distinction between the NIR and IR as the essence to the solution. The implementation would need to introduce the distinction between Non-Information and Information Resources. This would become two additional near-top-level classes, under core:UcoThing. This would be a nod to the concepts really being RDFS concepts, but not defined with RDFS IRIs. We should also avoid entailing RDFS semantics of rdfs:Resource being the top-level class, because of the tension such would create with OWL and owl:Thing being the top-level class.

Introducing this distinction within UCO indeed makes it possible to enforce the uniqueness-by-UUID demand for non-information resources only, while still allowing to include information resources by their original URLs. Unfortunately, this does not solve the problem because UCO cannot assume, as RFC9110 HTTP Semantics does, that core:NonInformationResource and core:InformationResource are always disjoint and remain as such. Instead, UCO must follow the reality where an IR can change into an NIR, as explained in Competencies 1 and 2. In these particular cases, the rule "add an UUID for NIRs only" fails.

Therefore, the actual solution is to

  1. inverse the rule as follows: "If the NIR is ever known to be an IR somewhere during its lifetime, excuse it from adding a UUID to its identifier.", and
  2. identify the possibility that a resource can never become an IR.

To that end, we suggest to introduce the following concepts in UCO:

core:NonInformationResource 
    rdfs:subClassOf core:UcoThing ;
    .
core:InformationResource 
    rdfs:subClassOf core:UcoThing ;
    .
core:NeverInformationResource 
    rdfs:subClassOf core:NonInformationResource ;
    owl:disjointWith core:InformationResource ;
    .

We also introduce observable:WebResource as a parent to observable:WebPage, to acknowledge web resources that are not yet known to be an IR or NIR, as well as to show the cyber-domain hijacking event (see Benefits section):

observable:WebResource
    rdfs:subClassOf observable:ObservableObject ;
    .
observable:WebPage
    rdfs:subClassOf observable:WebResource ;
    rdfs:subClassOf core:InformationResource ;
    .

Based on this distinction, the specifications as exemplified in the CQ's become syntactically correct and semantically valid. Visually, this renders as follows:

flowchart BT

  core_UcoThing[core:UcoThing]
  core_InformationResource[core:InformationResource]
  core_NonInformationResource[core:NonInformationResource]
  core_NeverInformationResource[core:NeverInformationResource]
  core_UcoObject[core:UcoObject]
  core_Item[core:Item]
  observable_Observable[observable:Observable]
  observable_ObservableObject[observable:ObservableObject]
  observable_WebResource[observable:WebResource]
  observable_WebPage[observable:WebPage]

core_InformationResource -- ⊂ --> core_UcoThing
core_NonInformationResource -- ⊂ --> core_UcoThing
core_NeverInformationResource -- ⊂ --> core_NonInformationResource
core_InformationResource x-- ⋂=∅ --x core_NeverInformationResource

core_UcoObject -- ⊂ --> core_UcoThing
core_Item -- ⊂ --> core_UcoObject
observable_Observable -- ⊂ --> core_UcoObject
observable_ObservableObject -- ⊂ --> core_Item
observable_ObservableObject -- ⊂ --> observable_Observable
observable_WebResource -- ⊂ --> observable_ObservableObject
observable_WebPage -- ⊂ --> core_InformationResource
observable_WebPage -- ⊂ --> observable_WebResource

Apart from the above additions to UCO, we suggest to perform an initial alignment. The Risks section should make clear the benefit of such alignment, particularly pertaining to some existing practices (outside of UCO) on designating graph nodes with RDF types analogous to UCO's identity:Person and observable:WebPage.

identity:Organization
    rdfs:subClassOf core:NeverInformationResource ;
    .
identity:Person
    rdfs:subClassOf core:NeverInformationResource ;
    .
observable:Device
    rdfs:subClassOf core:NeverInformationResource ;
    .
types:Dictionary
    rdfs:subClassOf core:NeverInformationResource ;
    .
types:Hash
    rdfs:subClassOf core:NeverInformationResource ;
    .
types:Thread
    rdfs:subClassOf core:NeverInformationResource ;
    .

Competencies demonstrated

Competency 1

Assume data that are containing URL's as digital resources, i.e. <https://caseontology.org/index.html>, as well as data that are containing non-information resources, i.e., identity:Organization.

<https://caseontology.org/index.html> 
  a observable:WebPage ;
  a core:InformationResource ;   # entailed by the prior line
.

kb:WebPage-d40dd5c8-1f68-490d-b6b8-bd90b1251b4b
    a observable:WebPage ;
    a core:NonInformationResource ;   # not entailed, but is compatible with, the prior line
    core:hasFacet kb:URLFacet-93a3e3c6-77e9-4e0d-99f4-2d0477cd7263 ;
    .

kb:URLFacet-93a3e3c6-77e9-4e0d-99f4-2d0477cd7263
    a observable:URLFacet ;
    observable:fullValue "https://caseontology.org/index.html" ;
    .

Competency Question 1.1

Showing that both can be included in the UCO digital realm, where the NIR must carry a UUID in its identification whereas the information resource does not have the same need.

Show all IRIs that identify a webpage:

SELECT DISTINCT ?u
WHERE {
 ?u a observable:WebPage  .
}

Result 1.1

<https://caseontology.org/index.html>
kb:WebPage-d40dd5c8-1f68-490d-b6b8-bd90b1251b4b

Competency Question 1.2

Show the distinction between the NIR that is requested and the IR that is served about the NIR.

For this distinction to be assessed, we introduce an additional relationship that expresses the HTTP 301 Return Code and allows to construct the following data graph:

<http://caseontology.org/>
  a observable:WebPage ;
  .
<https://caseontology.org/index.html>
  a observable:WebPage ;
  .
kb:Relationship-b42de25e-e0b0-4743-8b05-1ed401ea18ed
  a observable:ObservableRelationship ;
  core:isDirectional true ;
  core:kindOfRelationship "Redirects_To_By_HTTP_301" ;
  core:source <http://caseontology.org/> ;
  core:target <https://caseontology.org/index.html> ;
  .

We formulate the following SPARQL, to find all information resources arrived at by redirection---which suggests an entailment of ?sourceObject being a core:NonInformationResource:

SELECT DISTINCT ?sourceObject ?targetIR
WHERE {
  ?relationship a observable:ObservableRelationship ;
    core:kindOfRelationship "Redirects_To_By_HTTP_301" ;
    core:source ?sourceObject ;
    core:target ?targetIR ;
    .
  ?sourceObject a observable:WebResource .
  ?targetIR a core:InformationResource .
}

Result 1.2

?sourceObject ?targetIR
<http://caseontology.org/> <https://caseontology.org/index.html>

By being rdf:type'd as a observable:WebPage, both of these IRIs are excused from the UUID review rule, even though ?sourceObject is in this case behaving as a core:NonInformationResource.

Competency 2

Say the webpage of a multilingual company (MC) is being accessed by two market analysts in a multinational organization, who routinely contribute to a shared knowledge base in the organization. Their offices are in different countries that happen to use languages MC supports, Japan and France. MC's default language is Japanese.

The Japanese analyst visits the home page, https://mc.example.co.jp/, and is served content from that URL. The French analyst visits the home page, https://mc.example.co.jp/, and is 303-redirected to https://mc.example.co.jp/lang-fr/ by server-side client-geolocation rules.

Neither analyst knows the other is trying to access https://mc.example.co.jp/.

Competency Question 2.1

What are the representations of the Japanese analyst and the French analyst, using InformationResource, NonInformationResource, NeverInformationResource, WebResource, and/or WebPage?

The Japanese analyst:

<https://mc.example.co.jp/>
    a observable:WebPage ;
    .

The French analyst:

<https://mc.example.co.jp/>
    a
        core:NonInformationResource ,
        observable:WebResource
        ;
    .
<https://mc.example.co.jp/lang-fr/>
    a observable:WebPage ;
    .

Even if pooled in the shared knowledge base, this total knowledge view remains consistent (i.e. does not raise SHACL validation errors).

<https://mc.example.co.jp/>
    a
        core:NonInformationResource ,
        observable:WebPage
        ;
    .
<https://mc.example.co.jp/lang-fr/>
    a observable:WebPage ;
    .

This provides an example of a web resource that is, by differential service, contingently a InformationResource and/or a NonInformationResource.

Competency Question 2.2

Are the views consistent when pooled into one graph without any notes on time of observation (i.e., does not raise SHACL validation issues)?

Yes. The testing in PR 610 confirms no SHACL violations are raised. The visual display of the classes and how this example doesn't hit a class-disjointedness issue is as follows (using "⊂" for subclassing (rdfs:subClassOf), "⋂=∅" for class-disjointedness (owl:disjointWith), and "∈" for instantiation (rdf:type)).

flowchart BT

subgraph TBox
  core_UcoThing[core:UcoThing]
  core_InformationResource[core:InformationResource]
  core_NonInformationResource[core:NonInformationResource]
  core_NeverInformationResource[core:NeverInformationResource]
  core_UcoObject[core:UcoObject]
  core_Item[core:Item]
  observable_Observable[observable:Observable]
  observable_ObservableObject[observable:ObservableObject]
  observable_WebResource[observable:WebResource]
  observable_WebPage[observable:WebPage]
end

subgraph ABox
  wp1[https://mc.example.co.jp/]
  wp2[https://mc.example.co.jp/lang-fr]
end

core_InformationResource -- ⊂ --> core_UcoThing
core_NonInformationResource -- ⊂ --> core_UcoThing
core_NeverInformationResource -- ⊂ --> core_NonInformationResource
core_InformationResource x-- ⋂=∅ --x core_NeverInformationResource

core_UcoObject -- ⊂ --> core_UcoThing
core_Item -- ⊂ --> core_UcoObject
observable_Observable -- ⊂ --> core_UcoObject
observable_ObservableObject -- ⊂ --> core_Item
observable_ObservableObject -- ⊂ --> observable_Observable
observable_WebResource -- ⊂ --> observable_ObservableObject
observable_WebPage -- ⊂ --> core_InformationResource
observable_WebPage -- ⊂ --> observable_WebResource

wp1 -- ∈\n(per French analyst) --> core_NonInformationResource
wp1 -- ∈\n(per French analyst) --> observable_WebResource
wp1 -- ∈\n(per Japanese analyst) --> observable_WebPage
wp2 -- ∈\n(per French analyst) --> observable_WebPage

Risk / Benefit analysis

Benefits

  1. Due to the distinction made between URLs that are already unique because available in the digital realm as resolvable resources, there is no need to alter their identifying URL;
  2. A modeling benefit is this opens up modeling, e.g., differential server behaviors. This also exposes semi-rigidity of InformationResource as a class, which is a fundamental security concern of the web. E.g. this is true today, because if you visit the latter URL, that's the document you're served. Suppose something wretched happens and you visit the latter URL (.../index.html) tomorrow and encounter this:
kb:Relationship-89d3ae49-ab75-4689-886f-ad018a3fadbb
  a observable:ObservableRelationship ;
  core:isDirectional true ;
  core:kindOfRelationship "Redirects_To_By_HTTP_308" ;
  core:source <https://caseontology.org/index.html> ;
  core:target <https://caseontology.org/something_really_wretched.html> ;
  .

It is likely worthwhile being able to model those two web pages as direct graph individuals, <https://caseontology.org/index.html> and <https://caseontology.org/something_really_wretched.html>, e.g. for describing the non-continuous time intervals in which they resolve.

Risks

The risks associated with this change is related to services that are not enforcing a correct interpretation of the distinction between NIRs and IRs. In those cases, resources that are NIRs in actuality but are served by the service by returning information about the resource as opposed by inserting a HTTP-301 status, imply that the resources are IRs, ruining the aforementioned distinction.

For instance, suppose a personnel indexing service is deployed that uses home pages as person identifiers for an example organization:

<http://example.org/~bob> a foaf:Person .

Suppose also that http://example.org/~bob, when visited, is served as HTML in a browser.

This service cannot integrate into an environment where information resources and non-information resources are held disjoint. foaf:Person is one of the typical examples of a non-information resource. The home page for Bob is an information resource.

Integration of such a data source would need to split the (generic) resource http://example.org/~bob into independent entities, likely that follow the UCO UUID IRI naming scheme.

<http://example.org/~bob>
    a observable:WebResource ;
    rdfs:seeAlso
        kb:Person-a3d3af3d-ea1d-47f6-bc02-ac334ded6549 ,
        kb:WebPage-1c05c378-124e-4d3c-898a-fb5a8d178cf8
        ;
    .
kb:Person-a3d3af3d-ea1d-47f6-bc02-ac334ded6549
    a identity:Person ;
    rdfs:seeAlso <http://example.org/~bob> ;
    core:name "Bob" ;
    .
kb:WebPage-1c05c378-124e-4d3c-898a-fb5a8d178cf8
    a observable:WebPage ;
    rdfs:seeAlso <http://example.org/~bob> ;
    .

Coordination