me-box / databox

Databox container manager and dashboard server
MIT License
94 stars 25 forks source link

Proposal: "Column" metadata (longer term) #135

Open yousefamar opened 7 years ago

yousefamar commented 7 years ago

This is dependent on documenting required/optional Hypercat rels (#8, #102) and implementing the embedding of schema into manifests/catalogues (#7).

Proposed Enhancement

In addition to knowing data schema (e.g. the hierarchy and primitive types of JSON objects, or the primite types of CSV columns, or however), it would be especially useful to attach arbitrary metadata to individual columns. This metadata can be easily expressed as Hypercat rels with our existing system.

Use Case

One use case I have in mind is for using this metadata as inputs for information-theoretic risk metrics. Specifically, using the language in the literature 1 3 5, labelling a column as one of:

Of course in theory the metadata could be anything, including subjective context hints.

Benefits

NB: These particular metrics are normally used for datasets where one row corresponds to one individual, but the same concepts translate directly to granularity in IoT time-series data of a single individual. Further information-theoretic metrics (mutual entropy, surprisal) can additionally be used to account for other risks (see some experimental python scripts in my weekly reports that do this).

I'd appreciate any thoughts or comments.

mor1 commented 6 years ago

Who will do this labelling? Not sure I can see how they will function if it isn't the system that applies and maintains the measures. Having some means to use such risk-measures to make access-control decisions seems sensible though.