Proposal: "Column" metadata (longer term)

This is dependent on documenting required/optional Hypercat rels (#8, #102) and implementing the embedding of schema into manifests/catalogues (#7).

Proposed Enhancement

In addition to knowing data schema (e.g. the hierarchy and primitive types of JSON objects, or the primite types of CSV columns, or however), it would be especially useful to attach arbitrary metadata to individual columns. This metadata can be easily expressed as Hypercat rels with our existing system.

Use Case

One use case I have in mind is for using this metadata as inputs for information-theoretic risk metrics. Specifically, using the language in the literature 1 3 5, labelling a column as one of:

Explicit Identifier: Information that uniquely identifies an individual (e.g. structured: Social Security Numbers)
Quasi-identifier: Information that on its own doesn't identify users, but can when correlated with other columns (e.g. structured: post code, birth date; time-series: time, coords)
Sensitive Attribute: Sensitive person-specific information (e.g. structured: disease, salary, time-series: blood sugar level, etc.)
Non-Sensitive Attribute: Everything else

Of course in theory the metadata could be anything, including subjective context hints.

Benefits

This would give us enough context to implement a range of content-independent privacy/risk measures, as well as risk-reduction components (as standalone apps, or within stores), the most basic of which is k-anonymity 5, and extending this to account for specific types of attacks, l-diversity 6 and t-closeness 7. These are well established and also covered by exiting privacy risk analysis tools such as the open source ARX, but there are potentially plenty more we could consider 1.
This can be combined with our access control model to say, e.g. "this token only lets you access up to 3-diverse data", and provided the store (or whatever "middlebox" component) keeps track of the risk stats of their data, it can automatically block access if data becomes too risky.
Similarly, somebody could then e.g. make an app that only does some level of masking and can accept data from any store, outputting a less risky equivalent with otherwise the same base metadata. This would mean that apps that consume this data can automatically ask for only the output store with the optimal level of detail/risk it requires, and embedding acceptable risk level into tokens would just be an extra precaution.
Continuous data is effectively discretised by e.g. a k-anonymisation step, so other metrics that handle distributions (mutual entropy, KL-divergence, surprisal) no longer need to split data into bins for the results to be of any use, and the results will be more consistent (since you no longer need to compute how many bins to use).

NB: These particular metrics are normally used for datasets where one row corresponds to one individual, but the same concepts translate directly to granularity in IoT time-series data of a single individual. Further information-theoretic metrics (mutual entropy, surprisal) can additionally be used to account for other risks (see some experimental python scripts in my weekly reports that do this).

I'd appreciate any thoughts or comments.

me-box / databox