The definition of types:Dictionary reads (emphasis added):
A dictionary is list of (term/key, value) pairs with each term/key existing no more than once.
UCO does not currently test this key-uniqueness within the encoding in the ontology. SHACL provides a mechanism, via SHACL-SPARQL, to encode this uniqueness constraint in the ontology.[^1] UCO adopted SHACL in version 0.7.0, but this definition predates version 0.7.0.
With a key-uniqueness enforcement mechanism, UCO can serve a role in detecting repeated dictionary keys in data flows. E.g., UCO can assist with detecting some instances of Common Weakness Enumeration 694 (CWE-694), "Use of Multiple Resources with Duplicate Identifier," if scoping "Resource" to "Value within a key-value store represented as a dictionary data structure." (This example is from a non-exhaustive review of the CWE dictionary.)
Requirements
Requirement 1
UCO must enforce the uniqueness of dictionary keys in types:Dictionary, or some subclass of types:Dictionary.
Requirement 2
UCO must clarify whether a dictionary that repeats a key-value pair across two or more entries is considered conformant.
Requirement 3
(Added 2024-06-04.)
UCO must support an explicit validation mechanism to validate dictionary entry-keys' uniqueness, in an opt-in manner.
Requirement 4
(Added 2024-06-04.)
UCO must support a mechanism to report when a dictionary violates the expectation of entry-key uniqueness.
Requirement 5
(Added 2024-06-04.)
UCO must support the ability to report what key in a dictionary was found to be repeated, without also requiring a disclosure of the dictionary's key-values. Note that this entails being able to share an empty dictionary, similar to Issue 599.
Requirement 6
(Added 2024-06-04.)
If a dictionary has a repeated key reported, the dictionary must be reported as a dictionary violating the entry-key uniqueness expectation.
Risk / Benefit analysis
Benefits
Adding uniqueness enforcement to types:Dictionary would enable UCO's SHACL validation to catch data oddities in line with the types:Dictionary class's specification.
Risks
It is possible UCO tooling will encounter subject data for ingest into a graph, where a purported unique-key dictionary does not have unique keys. If UCO were to implement a SHACL-SPARQL query confirming key uniqueness right now on types:Dictionary, this leaves a coverage gap on subject data that, by some specification, is a dictionary, but happens to not follow key uniqueness. In some contexts this can be significant information, so there may be need to add nuance to the implementation around key-uniqueness.
One possible solution is specializing the UCO Dictionary with two subclasses: ProperDictionary and ImproperDictionary, borrowing "Proper" from "proper subset" (subset where there exists a non-member of the subset in the superset) and "proper interval" (interval of non-0 length).
A proper dictionary would be known to, and/or required to, have unique keys;
an improper dictionary would be known to have some repeated key;
a types:Dictionary not further subclassed would be left to the UCO consumer to ultimately test.
Adding ProperDictionary and ImproperDictionary would carry a further risk, that existing child classes of Dictionary would not necessarily automatically acquire proper-ness or improper-ness. Multi-typing would need to be used instead, e.g.:
Last, it might not always be possible to check with SHACL validation that something asserted to be a types:ImproperDictionary has a repeated key. In partial-data-sharing scenarios, other SHACL constraints in the types: namespace would require disclosure of the entire DictionaryEntry object for at least two of the key-repetitions, and this might not be universally desirable.
A new owl:DatatypePropertytypes:repeatsKey on types:ImproperDictionary might assist with partial-data sharing issues.
(Added 2024-06-04.)
The proposed types:repeatsKey carries an implication that it is being used on a a dictionary that is a types:ImproperDictionary. If used, types:ImproperDictionary should be entailed for the sake of class-based data review mechanisms (i.e., searches in graphs for types:ImproperDictionary). types:repeatsKey should be added with an RDFS domain declaration of types:ImproperDictionary, and this domain assertion should be included in SHACL validation. (Practices to do this are already enacted in the UCO OWL review shapes.)
Competencies demonstrated
Competency 1
A configuration file, deliberately non-conformant to its specification that it provide unique keys, is fed through a content-posting ecosystem, where a security tool tests resources with a "last-read-wins" key-value parser, and the ecosystem's consumers primarily use a consumer tool with a "first-read-wins" key-value parser:
For UCO consumers that are JSON-based, and not JSON-LD-based, the DictionaryEntry order from (pseudo-)random UUIDs could affect downstream results.
Competency Question 1.1
If a tool reads this into a types:Dictionary, what would happen against the current UCO specification (1.3.0)?
Result 1.1
Per the definition in types:Dictionary, this SHOULD raise some kind of data validation error, but the responsible tester is not designated.
Competency Question 1.2
How would an ingest-to-UCO process represent that the source file had a repeated key, in the UCO graph?
Result 1.2
There is not currently a specification on how to handle this, or whether it would be appropriate to store, say, only the "proper" Dictionary keys.
If the ProperDictionary / ImproperDictionary strategy is selected for implementation, graph-populating programs could start creating Dictionary objects and specialize them after parsing source-data into ProperDictionary or ImproperDictionary as appropriate.
If the types:repeatsKey property is accepted, that property could be used with the Dictionary (/ ImproperDictionary) object to record a part of the malformed data more likely to be desired to share.
(Added 2024-06-04.)
The types:repeatsKey property should only be used on types:ImproperDictionary.
Solution suggestion
This SHACL-SPARQL constraint would test for repeated instances of keys.
[]
a sh:SPARQLConstraint ;
sh:message "A key in a dictionary can appear no more than once."@en ;
sh:select """
PREFIX types: <https://ontology.unifiedcyberontology.org/uco/types/>
SELECT $this ?value
WHERE {
$this
types:entry/types:key ?value ;
.
}
GROUP BY ?value
HAVING (COUNT(?value) > 1)
""" ;
.
Note: This would also reject a dictionary where a key-value pair is repeated. Requirement 2 will inform whether this should be adjusted.
Where this constraint is attached depends on whether the ProperDictionary / ImproperDictionary subclasses strategy is adopted. "Backwards-compatibility" below means data that raise no SHACL sh:Violation-severity results today would only, at worst, raise sh:Warning-severity results until UCO 2.0.0.
If not adopting the new subclasses: For conformance checking of UCO's current definition, the SPARQL constraint would be added to types:Dictionary.
For backwards-compatibility matters, the constraint would raise sh:Warning-severity validation results for UCO < 2.0.0, sh:Violation-level for UCO 2.0.0.
The (English) definition text for types:Dictionary would not change.
If the new subclasses are adopted, the constraint would go onto types:ProperDictionary.
The (English) definition text of types:Dictionary would need to change to suggest use of the subclasses in order to confirm conformance with key-uniqueness.
Again for backwards-compatibility matters, some thought needs to be given on whether the SPARQL constraint should be repeated in types:Dictionary with a sh:Warning severity, in order to alert about data that is contrary to the definition text's set expectations.
(Added 2024-06-04.)
After discussion from the 2024-05-30 meeting, the new, disjoint dictionary subclasses and the repeatsKey were implemented in a PR superseding the original PR.
On addition of the dictionary subclasses, the SPARQL constraint checking for repeated keys in plaintypes:Dictionarys ended up appearing to be more permanent than originally anticipated. There is no reason based on backwards-compatibility to remove the shape that merely warns of repeated keys. Unfortunately, there is not a way to specify in SHACL that the constraint should only run on types:Dictionarys that are not also types:ProperDictionarys (which runs its own SHACL constraint of higher severity), unless OWL entailment is set as an operational requirement. General entailment requirements on users is purposefully left out of scope of this proposal, save for one special-purpose detail on repeatsKey.
It is fair to discuss whether UCO should always review all dictionaries for key uniqueness. The shape performing this review is given its own IRI, so it is possible for users to use sh:deactivated to deactivate the shape when their operations are otherwise prepared to address key repetitions (such as through some process that always assigns the proper or improper dictionary type).
repeatsKey induced Requirement 5, enabling "empty" dictionaries for partial data-sharing scenarios. It also induced a data safety review mechanism, adding an rdfs:domain declaration with an accompanying test that the domain is satisfied, whether through explicit typing (i.e. hard-coded assignment of types:ImproperDictionary), or through entailment (whether RDFS entailment or OWL entailment).
types:repeatsKey
a owl:DatatypeProperty ;
rdfs:label "repeatsKey"@en ;
rdfs:comment "A key found to be repeated in multiple dictionary entries within one dictionary."@en ;
rdfs:domain types:ImproperDictionary ;
rdfs:range xsd:string ;
.
types:repeatsKey-subjects-shape
a sh:NodeShape ;
sh:class types:ImproperDictionary ;
sh:targetSubjectsOf types:repeatsKey ;
.
This is a deviation from UCO generally avoiding usage of rdfs:domain. repeatsKey is offered as a property with sufficient gravity that its presence should ensure review mechanisms for handling improper dictionaries are triggered.
Background
The definition of
types:Dictionary
reads (emphasis added):UCO does not currently test this key-uniqueness within the encoding in the ontology. SHACL provides a mechanism, via SHACL-SPARQL, to encode this uniqueness constraint in the ontology.[^1] UCO adopted SHACL in version 0.7.0, but this definition predates version 0.7.0.
With a key-uniqueness enforcement mechanism, UCO can serve a role in detecting repeated dictionary keys in data flows. E.g., UCO can assist with detecting some instances of Common Weakness Enumeration 694 (CWE-694), "Use of Multiple Resources with Duplicate Identifier," if scoping "Resource" to "Value within a key-value store represented as a dictionary data structure." (This example is from a non-exhaustive review of the CWE dictionary.)
Requirements
Requirement 1
UCO must enforce the uniqueness of dictionary keys in
types:Dictionary
, or some subclass oftypes:Dictionary
.Requirement 2
UCO must clarify whether a dictionary that repeats a key-value pair across two or more entries is considered conformant.
Requirement 3
(Added 2024-06-04.)
UCO must support an explicit validation mechanism to validate dictionary entry-keys' uniqueness, in an opt-in manner.
Requirement 4
(Added 2024-06-04.)
UCO must support a mechanism to report when a dictionary violates the expectation of entry-key uniqueness.
Requirement 5
(Added 2024-06-04.)
UCO must support the ability to report what key in a dictionary was found to be repeated, without also requiring a disclosure of the dictionary's key-values. Note that this entails being able to share an empty dictionary, similar to Issue 599.
Requirement 6
(Added 2024-06-04.)
If a dictionary has a repeated key reported, the dictionary must be reported as a dictionary violating the entry-key uniqueness expectation.
Risk / Benefit analysis
Benefits
Adding uniqueness enforcement to
types:Dictionary
would enable UCO's SHACL validation to catch data oddities in line with thetypes:Dictionary
class's specification.Risks
It is possible UCO tooling will encounter subject data for ingest into a graph, where a purported unique-key dictionary does not have unique keys. If UCO were to implement a SHACL-SPARQL query confirming key uniqueness right now on
types:Dictionary
, this leaves a coverage gap on subject data that, by some specification, is a dictionary, but happens to not follow key uniqueness. In some contexts this can be significant information, so there may be need to add nuance to the implementation around key-uniqueness.One possible solution is specializing the UCO
Dictionary
with two subclasses:ProperDictionary
andImproperDictionary
, borrowing "Proper" from "proper subset" (subset where there exists a non-member of the subset in the superset) and "proper interval" (interval of non-0 length).types:Dictionary
not further subclassed would be left to the UCO consumer to ultimately test.Adding
ProperDictionary
andImproperDictionary
would carry a further risk, that existing child classes ofDictionary
would not necessarily automatically acquire proper-ness or improper-ness. Multi-typing would need to be used instead, e.g.:Last, it might not always be possible to check with SHACL validation that something asserted to be a
types:ImproperDictionary
has a repeated key. In partial-data-sharing scenarios, other SHACL constraints in thetypes:
namespace would require disclosure of the entireDictionaryEntry
object for at least two of the key-repetitions, and this might not be universally desirable. A newowl:DatatypeProperty
types:repeatsKey
ontypes:ImproperDictionary
might assist with partial-data sharing issues.(Added 2024-06-04.)
The proposed
types:repeatsKey
carries an implication that it is being used on a a dictionary that is atypes:ImproperDictionary
. If used,types:ImproperDictionary
should be entailed for the sake of class-based data review mechanisms (i.e., searches in graphs fortypes:ImproperDictionary
).types:repeatsKey
should be added with an RDFS domain declaration oftypes:ImproperDictionary
, and this domain assertion should be included in SHACL validation. (Practices to do this are already enacted in the UCO OWL review shapes.)Competencies demonstrated
Competency 1
A configuration file, deliberately non-conformant to its specification that it provide unique keys, is fed through a content-posting ecosystem, where a security tool tests resources with a "last-read-wins" key-value parser, and the ecosystem's consumers primarily use a consumer tool with a "first-read-wins" key-value parser:
If transcribed into UCO (1.3.0) without checking for key-uniqueness, this would be the resulting graph:
For UCO consumers that are JSON-based, and not JSON-LD-based, the
DictionaryEntry
order from (pseudo-)random UUIDs could affect downstream results.Competency Question 1.1
If a tool reads this into a
types:Dictionary
, what would happen against the current UCO specification (1.3.0)?Result 1.1
Per the definition in
types:Dictionary
, this SHOULD raise some kind of data validation error, but the responsible tester is not designated.Competency Question 1.2
How would an ingest-to-UCO process represent that the source file had a repeated key, in the UCO graph?
Result 1.2
There is not currently a specification on how to handle this, or whether it would be appropriate to store, say, only the "proper"
Dictionary
keys.If the
ProperDictionary
/ImproperDictionary
strategy is selected for implementation, graph-populating programs could start creatingDictionary
objects and specialize them after parsing source-data intoProperDictionary
orImproperDictionary
as appropriate.If the
types:repeatsKey
property is accepted, that property could be used with theDictionary
(/ImproperDictionary
) object to record a part of the malformed data more likely to be desired to share.(Added 2024-06-04.)
The
types:repeatsKey
property should only be used ontypes:ImproperDictionary
.Solution suggestion
This SHACL-SPARQL constraint would test for repeated instances of keys.
Note: This would also reject a dictionary where a key-value pair is repeated. Requirement 2 will inform whether this should be adjusted.
Where this constraint is attached depends on whether the
ProperDictionary
/ImproperDictionary
subclasses strategy is adopted. "Backwards-compatibility" below means data that raise no SHACLsh:Violation
-severity results today would only, at worst, raisesh:Warning
-severity results until UCO 2.0.0.types:Dictionary
.sh:Warning
-severity validation results for UCO < 2.0.0,sh:Violation
-level for UCO 2.0.0.types:Dictionary
would not change.types:ProperDictionary
.types:Dictionary
would need to change to suggest use of the subclasses in order to confirm conformance with key-uniqueness.types:Dictionary
with ash:Warning
severity, in order to alert about data that is contrary to the definition text's set expectations.(Added 2024-06-04.)
After discussion from the 2024-05-30 meeting, the new, disjoint dictionary subclasses and the
repeatsKey
were implemented in a PR superseding the original PR.On addition of the dictionary subclasses, the SPARQL constraint checking for repeated keys in plain
types:Dictionary
s ended up appearing to be more permanent than originally anticipated. There is no reason based on backwards-compatibility to remove the shape that merely warns of repeated keys. Unfortunately, there is not a way to specify in SHACL that the constraint should only run ontypes:Dictionary
s that are not alsotypes:ProperDictionary
s (which runs its own SHACL constraint of higher severity), unless OWL entailment is set as an operational requirement. General entailment requirements on users is purposefully left out of scope of this proposal, save for one special-purpose detail onrepeatsKey
.It is fair to discuss whether UCO should always review all dictionaries for key uniqueness. The shape performing this review is given its own IRI, so it is possible for users to use
sh:deactivated
to deactivate the shape when their operations are otherwise prepared to address key repetitions (such as through some process that always assigns the proper or improper dictionary type).repeatsKey
induced Requirement 5, enabling "empty" dictionaries for partial data-sharing scenarios. It also induced a data safety review mechanism, adding anrdfs:domain
declaration with an accompanying test that the domain is satisfied, whether through explicit typing (i.e. hard-coded assignment oftypes:ImproperDictionary
), or through entailment (whether RDFS entailment or OWL entailment).This is a deviation from UCO generally avoiding usage of
rdfs:domain
.repeatsKey
is offered as a property with sufficient gravity that its presence should ensure review mechanisms for handling improper dictionaries are triggered.Coordination