openspending / fiscal-data-package

MOVED TO https://github.com/frictionlessdata/specs/issues?q=is%3Aopen+is%3Aissue+label%3A%22Fiscal+Data+Package%22
24 stars 7 forks source link

How to indicate one attribute is a label (for another attribute) #99

Closed danfowler closed 8 years ago

danfowler commented 8 years ago

An ongoing point of concern for implementers is that it is unclear how to specify a label for a given element of a dimension. For example, given a set of columns in the source data...

amount year level_1_code level_1_title
1000 2014 01 General Public Services

...it is not made explicit in the spec how level_1_title relates to level_1_code. Only one of these is required to reference the given concept ("level_1") for the purposes of, say, aggregation, but the spec is not explicit on how client code should use information present in the data to automatically generate labeling for a visualization, for example.

Relevant discussion:

pwalsh commented 8 years ago

I've presented two related arguments / solutions for this problem in https://github.com/openspending/fiscal-data-package/issues/96

# data.csv
YEAR,ECON4,ECON4_NAME,ECON5,ECON5_NAME,PAID
2006,100,PERSONAL SERVICES,110,BASIC COMPENSTION,67532
2006,100,PERSONAL SERVICES,120,TEMPORARY COMPENSATION,309

Dimensions as attribute groups

Knowing that something is a label for something else strongly implies that each of these things is a property of a greater thing that contains them. When we acknowledge this, we say that one property represents the thing (the identifier), and then other properties further describe the thing.

"attributes": [
  { "econ4": { "source": "ECON4" }, "econ4_title": { "source": "ECON4_NAME" }, "parent": null }
  { "econ5": { "source": "ECON5" }, "econ5_title": { "source": "ECON5_NAME" }, "parent": 100 }
]

Standardise a minimal number of attribute names

UIs and databases can provide additional features when they can expect a series of related objects to follow some naming conventions for their properties. We do not need to go overboard here, but providing the ability to call something the identifier for an object, and something the title of the object (or label, if you prefer), opens up additional possibilities at no tangible cost I can see for publishers to the spec.

"attributes": [
  { "code": { "source": "ECON4" }, "title": { "source": "ECON4_NAME" }, "parent": null }
  { "code": { "source": "ECON5" }, "title": { "source": "ECON5_NAME" }, "parent": 100 }
]

In the above example, we've said that code is a special thing that identifies the object, and title is another thing that can be used to label the object for human facing interfaces. We could call code id, but I do not want to confuse this suggestion with the general problem of unique identifiers for objects in Fiscal Data Package.

rufuspollock commented 8 years ago

@pwalsh i know and i'm strongly suggesting going this super simple route rather than inventing "subdimensions" / compound attributes.

In terms of naming I'm +1 on some standardization but I'm not sure it will always work e.g. in the hierarchical case.

I suggest labelfor or similar plus parent in #96 will solve hierarchy issues as we have them now plus labelfor may be valuable in its own right.

Re labels i would note that i think we are getting ourselves a bit confused in #96 by the fact that we are dealing with a denormalized table. The normalized classification table looks like:

ID Label Parent
01.02 XYZ 01
01.02.03 ABC 01.02

Here there is no problem working out what label is. Moreover, here we just have the normal "row" structure of relational databases where we get that given columns relate to other columns (though labelfor could still be useful).

I think we should be cautious about the using the "denormalized" special case of a classification table to start introducing "subdimensions". If we really want a separate subdimension I'd actually recommend it really being a separate table as per snowflake schema in OLAP.

pwalsh commented 8 years ago

@rgrp in terms of a possible derived database - yes, I'd certainly be aiming for a snowflake schema or similar in any event.

But in terms of representing the raw data in FDP, we can't escape the denorm thing as you know because most published spend data is basically a flattened representation of something that was relational in some database somewhere.

Still, I don't see how it matters: in the relational example above, you also need to be able to "say" that the column called "Label" is the label.

rufuspollock commented 8 years ago

FIXED in https://github.com/openspending/fiscal-data-package/commit/5ea66d251885741e52f9f7e43b472cbcb1e54d52