ucoProject / UCO

This repository is for development of the Unified Cyber Ontology.
Apache License 2.0
73 stars 34 forks source link

UCO's Dictionary class should enforce key uniqueness #602

Open ajnelson-nist opened 1 month ago

ajnelson-nist commented 1 month ago

Background

The definition of types:Dictionary reads (emphasis added):

A dictionary is list of (term/key, value) pairs with each term/key existing no more than once.

UCO does not currently test this key-uniqueness within the encoding in the ontology. SHACL provides a mechanism, via SHACL-SPARQL, to encode this uniqueness constraint in the ontology.[^1] UCO adopted SHACL in version 0.7.0, but this definition predates version 0.7.0.

With a key-uniqueness enforcement mechanism, UCO can serve a role in detecting repeated dictionary keys in data flows. E.g., UCO can assist with detecting some instances of Common Weakness Enumeration 694 (CWE-694), "Use of Multiple Resources with Duplicate Identifier," if scoping "Resource" to "Value within a key-value store represented as a dictionary data structure." (This example is from a non-exhaustive review of the CWE dictionary.)

Requirements

Requirement 1

UCO must enforce the uniqueness of dictionary keys in types:Dictionary, or some subclass of types:Dictionary.

Requirement 2

UCO must clarify whether a dictionary that repeats a key-value pair across two or more entries is considered conformant.

Requirement 3

(Added 2024-06-04.)

UCO must support an explicit validation mechanism to validate dictionary entry-keys' uniqueness, in an opt-in manner.

Requirement 4

(Added 2024-06-04.)

UCO must support a mechanism to report when a dictionary violates the expectation of entry-key uniqueness.

Requirement 5

(Added 2024-06-04.)

UCO must support the ability to report what key in a dictionary was found to be repeated, without also requiring a disclosure of the dictionary's key-values. Note that this entails being able to share an empty dictionary, similar to Issue 599.

Requirement 6

(Added 2024-06-04.)

If a dictionary has a repeated key reported, the dictionary must be reported as a dictionary violating the entry-key uniqueness expectation.

Risk / Benefit analysis

Benefits

Adding uniqueness enforcement to types:Dictionary would enable UCO's SHACL validation to catch data oddities in line with the types:Dictionary class's specification.

Risks

It is possible UCO tooling will encounter subject data for ingest into a graph, where a purported unique-key dictionary does not have unique keys. If UCO were to implement a SHACL-SPARQL query confirming key uniqueness right now on types:Dictionary, this leaves a coverage gap on subject data that, by some specification, is a dictionary, but happens to not follow key uniqueness. In some contexts this can be significant information, so there may be need to add nuance to the implementation around key-uniqueness.

One possible solution is specializing the UCO Dictionary with two subclasses: ProperDictionary and ImproperDictionary, borrowing "Proper" from "proper subset" (subset where there exists a non-member of the subset in the superset) and "proper interval" (interval of non-0 length).

Adding ProperDictionary and ImproperDictionary would carry a further risk, that existing child classes of Dictionary would not necessarily automatically acquire proper-ness or improper-ness. Multi-typing would need to be used instead, e.g.:

{
    "@context": {
        "kb": "http://example.org/kb/",
        "types": "https://ontology.unifiedcyberontology.org/uco/types/"
    },
    "@id": "kb:controlled-dictionary-1",
    "@type": [
        "types:ControlledDictionary",
        "types:ProperDictionary"
    ]
}

Last, it might not always be possible to check with SHACL validation that something asserted to be a types:ImproperDictionary has a repeated key. In partial-data-sharing scenarios, other SHACL constraints in the types: namespace would require disclosure of the entire DictionaryEntry object for at least two of the key-repetitions, and this might not be universally desirable. A new owl:DatatypeProperty types:repeatsKey on types:ImproperDictionary might assist with partial-data sharing issues.

(Added 2024-06-04.)

The proposed types:repeatsKey carries an implication that it is being used on a a dictionary that is a types:ImproperDictionary. If used, types:ImproperDictionary should be entailed for the sake of class-based data review mechanisms (i.e., searches in graphs for types:ImproperDictionary). types:repeatsKey should be added with an RDFS domain declaration of types:ImproperDictionary, and this domain assertion should be included in SHACL validation. (Practices to do this are already enacted in the UCO OWL review shapes.)

Competencies demonstrated

Competency 1

A configuration file, deliberately non-conformant to its specification that it provide unique keys, is fed through a content-posting ecosystem, where a security tool tests resources with a "last-read-wins" key-value parser, and the ecosystem's consumers primarily use a consumer tool with a "first-read-wins" key-value parser:

# ...
resource_name: supply_chain_file_1234
retrieval_url: http://example.org/file-1.dat
retrieval_url: http://example.org/file-2.dat
# ...

If transcribed into UCO (1.3.0) without checking for key-uniqueness, this would be the resulting graph:

{
    "@context": {
        "kb": "http://example.org/kb/",
        "types": "https://ontology.unifiedcyberontology.org/uco/types/"
    },
    "@id": "kb:Dictionary-7b9a4526-8a61-4d3d-a83f-f188f2a1e3e9",
    "@type": "types:Dictionary",
    "types:entry": [
        {
            "@id": "kb:DictionaryEntry-274fb580-b752-4da9-817b-03297c08b969",
            "@type": "types:DictionaryEntry",
            "types:key": "resource_name",
            "types:value": "supply_chain_file_1234"
        },
        {
            "@id": "kb:DictionaryEntry-c9ebf792-f3a8-4015-b893-da01b73c5184",
            "@type": "types:DictionaryEntry",
            "types:key": "retrieval_url",
            "types:value": "http://example.org/file-2.dat"
        },
        {
            "@id": "kb:DictionaryEntry-e40d21a0-e34c-41fb-becb-3a3a70831749",
            "@type": "types:DictionaryEntry",
            "types:key": "retrieval_url",
            "types:value": "http://example.org/file-1.dat"
        }
    ]
}

For UCO consumers that are JSON-based, and not JSON-LD-based, the DictionaryEntry order from (pseudo-)random UUIDs could affect downstream results.

Competency Question 1.1

If a tool reads this into a types:Dictionary, what would happen against the current UCO specification (1.3.0)?

Result 1.1

Per the definition in types:Dictionary, this SHOULD raise some kind of data validation error, but the responsible tester is not designated.

Competency Question 1.2

How would an ingest-to-UCO process represent that the source file had a repeated key, in the UCO graph?

Result 1.2

There is not currently a specification on how to handle this, or whether it would be appropriate to store, say, only the "proper" Dictionary keys.

If the ProperDictionary / ImproperDictionary strategy is selected for implementation, graph-populating programs could start creating Dictionary objects and specialize them after parsing source-data into ProperDictionary or ImproperDictionary as appropriate.

If the types:repeatsKey property is accepted, that property could be used with the Dictionary (/ ImproperDictionary) object to record a part of the malformed data more likely to be desired to share.

(Added 2024-06-04.)

The types:repeatsKey property should only be used on types:ImproperDictionary.

Solution suggestion

This SHACL-SPARQL constraint would test for repeated instances of keys.

[]
    a sh:SPARQLConstraint ;
    sh:message "A key in a dictionary can appear no more than once."@en ;
    sh:select """
        PREFIX types: <https://ontology.unifiedcyberontology.org/uco/types/>
        SELECT $this ?value
        WHERE {
            $this
                types:entry/types:key ?value ;
                .
        }
        GROUP BY ?value
        HAVING (COUNT(?value) > 1)
    """ ;
    .

Note: This would also reject a dictionary where a key-value pair is repeated. Requirement 2 will inform whether this should be adjusted.

Where this constraint is attached depends on whether the ProperDictionary / ImproperDictionary subclasses strategy is adopted. "Backwards-compatibility" below means data that raise no SHACL sh:Violation-severity results today would only, at worst, raise sh:Warning-severity results until UCO 2.0.0.

(Added 2024-06-04.)

After discussion from the 2024-05-30 meeting, the new, disjoint dictionary subclasses and the repeatsKey were implemented in a PR superseding the original PR.

On addition of the dictionary subclasses, the SPARQL constraint checking for repeated keys in plain types:Dictionarys ended up appearing to be more permanent than originally anticipated. There is no reason based on backwards-compatibility to remove the shape that merely warns of repeated keys. Unfortunately, there is not a way to specify in SHACL that the constraint should only run on types:Dictionarys that are not also types:ProperDictionarys (which runs its own SHACL constraint of higher severity), unless OWL entailment is set as an operational requirement. General entailment requirements on users is purposefully left out of scope of this proposal, save for one special-purpose detail on repeatsKey.

It is fair to discuss whether UCO should always review all dictionaries for key uniqueness. The shape performing this review is given its own IRI, so it is possible for users to use sh:deactivated to deactivate the shape when their operations are otherwise prepared to address key repetitions (such as through some process that always assigns the proper or improper dictionary type).

repeatsKey induced Requirement 5, enabling "empty" dictionaries for partial data-sharing scenarios. It also induced a data safety review mechanism, adding an rdfs:domain declaration with an accompanying test that the domain is satisfied, whether through explicit typing (i.e. hard-coded assignment of types:ImproperDictionary), or through entailment (whether RDFS entailment or OWL entailment).

types:repeatsKey
    a owl:DatatypeProperty ;
    rdfs:label "repeatsKey"@en ;
    rdfs:comment "A key found to be repeated in multiple dictionary entries within one dictionary."@en ;
    rdfs:domain types:ImproperDictionary ;
    rdfs:range xsd:string ;
    .

types:repeatsKey-subjects-shape
    a sh:NodeShape ;
    sh:class types:ImproperDictionary ;
    sh:targetSubjectsOf types:repeatsKey ;
    .

This is a deviation from UCO generally avoiding usage of rdfs:domain. repeatsKey is offered as a property with sufficient gravity that its presence should ensure review mechanisms for handling improper dictionaries are triggered.

Coordination