Closed pwalsh closed 8 years ago
I also just ran into this, so thanks for bringing it up :)
For the sake of concreteness, an example (one line of header, one line of data):
Portfolio,Department/Agency,Outcome,Program,Expense type,Appropriation type,Description,2014-15,2015-16,2016-17,2017-18,2018-19,Source document,Source table,URL
Treasury Portfolio,Australian Accounting Standards Board (AASB),Outcome 1: The formulation and making of accounting standards that are used by Australian entities to prepare financial reports and enable users of these reports to make informed decisions,Programme 1.1: Australian Accounting Standards Board,Departmental expenses,Departmental appropriation & Expenses not requiring appropriation in the budget year,,4706,4631,4528,4542,4555,Australian Accounting Standards Board (AASB) 2015-16 PBS,Table 2.1,https://data.gov.au/dataset/budget-2015-16-tables-and-data/resource/91bfbe9f-714b-4a7b-8579-9cb2aec85209
The first four columns are organisational structure: Portfolio,Department/Agency,Outcome,Program
I think dimensions will be most useful if we are somewhat prescriptive about different types of dimensions. Two very common types of hierarchy (ok, I'm guessing) would be "organisational structure" (departments, programmes, projects) and function (eg, COFOG). Combining that view, your suggestion to remove dimensionType
and the need to express hierarchy, we'd get:
"dimensions": {
"organisation": {
"fields": {
"Portfolio": {
"source": "Portfolio",
"level": 1
},
"Department/Agency": {
"source": "Department/Agency",
"level": 2
},
"Outcome": {
"source": "Outcome",
"level": 3
},
"Programme": {
"source": "Program",
"level": 4
}
}
},
"function": {
"fields": {
"Expense type": {
"source": "Expense type",
"level": 1
},
"Appropriation type": {
"source": "Appropriation type",
"level": 2
},
"Description": {
"source": "Description",
"level": 3
}
}
}
}
Maybe there's a better way to express the hierarchy than the level
attribute - but of course JSON dictionaries are unordered.
Also noting in passing that the Fiscal spec isn't very clear about what it means for a dimension to be made up of more than one field. Is it because there needs to be a primary key, and that key might be made up of multiple fields? With this dataset, I think the primary key might be the first seven fields actually. :/
@stevage @pwalsh I think we are right on needing some kind of level attribute to indicate hierarachy and think using level
in this way seems the obvious way to start.
@stevage dimensions can definitely have multiple fields. Re the PK issue that is common with these datasets as they don't generate a primary key. I'm actually think that in these circumstances one would want to generate a PK (the PK stuff is more essential when you already had normalised source data - if the data is denormalized linking the "dimension" with the fact table is trivial). Given this point is a bit subtle could you open a separate issue re PKs for dimensions and the fact that they may not be a single field ...
Why not an array of hashes, so level is explicit in the sequence, as per the example I linked to: https://discuss.okfn.org/t/open-spending-data-structure-ideas-and-suggestions/300/20
@pwalsh Ah yep, didn't see the problem that was solving. It also does away with the field
property, which does seem a bit redundant? So my data would become:
{
"dimensions": {
"organisation": [
{
"source": "Portfolio",
"id": "Portfolio"
},
{
"source": "Department/Agency",
"id": "Department/Agency"
},
{
"source": "Outcome",
"id": "Outcome"
},
{
"source": "Program",
"id": "Programme"
}
],
"function": [
{
"source": "Expense type",
"id": "Expense type"
},
{
"source": "Appropriation type",
"id": "Appropriation type"
},
{
"source": "Description",
"id": "Description"
}
]
}
}
I didn't notice till now, but one of the fields, Department/Agency
, contains a slash, and is hence problematic for the source
property?
@rgrp:
Given this point is a bit subtle could you open a separate issue re PKs for dimensions and the fact that they may not be a single field ...
I'm not sure I know what the issue with PKs and fields is clearly enough to raise an issue. Hmm.
I've done some work to support levels in the work on #34 in https://github.com/openspending/fiscal-data-package/commit/1190f199504eaedd8cbed61fc3a7e88f93092c16
The basic answer here is that we can handle the denormed structure in the mapping. I should probably do an example for this as well as the basic structure that is there so leaving this open.
@rgrp looks like you combined the two approaches above?
@stevage first suggested fields as a hash of hashes, using a level
property on those nested hashes. I then showed an earlier example I did where fields is an array and thereby level is indicated by the index of the hash in the array.
The example in the commit uses an array and also each hash has a level property.
@pwalsh @rgrp I think the implicit ordering with an array works well when the each object can express multiple mappings at the same level (as in your example: id
-> budget_tree/id-1
, title
-> budget_tree/title-1
). In the current specification, we're ordering fields, some of which could exist at the same classification level. So I would think an object of objects would be preferable because the array can't implicitly express that two elements are at the same classification level.
Given the Madrid Municipal Budgets dataset:
AÑO,CENTRO,DESCRIPCION CENTRO,SECCION,DESCRIPCION SECCION,PROGRAMA,DESCRIPCION PROGRAMA,CAPITULO,DESCRIPCION CAPITULO,ECONOMICO,DESCRIPCION ECONOMICO,IMPORTE
2012,001,AYUNTAMIENTO DE MADRID,002,PRESIDENCIA DEL PLENO,91100,SECRETARÍA GENERAL DEL PLENO,1,GASTOS DE PERSONAL,12000,SUELDOS DEL GRUPO A1,102739
2012,001,AYUNTAMIENTO DE MADRID,002,PRESIDENCIA DEL PLENO,91100,SECRETARÍA GENERAL DEL PLENO,1,GASTOS DE PERSONAL,12001,SUELDOS DEL GRUPO A2,29354
2012,001,AYUNTAMIENTO DE MADRID,002,PRESIDENCIA DEL PLENO,91100,SECRETARÍA GENERAL DEL PLENO,1,GASTOS DE PERSONAL,12003,SUELDOS DEL GRUPO C1,215832
I would express the administrative classification dimension (CENTRO/SECCION: columns 2-5) this way:
{
"name": "administrative",
"dimensionType": "classification",
"classificationType": "administrative"
"fields": [
{
"name": "CENTRO",
"level": 1,
"source": "CENTRO"
},
{
"name": "DESCRIPCION CENTRO",
"level": 1,
"source": "DESCRIPCION CENTRO"
},
{
"name": "SECCION",
"level": 2,
"source": "SECCION"
},
{
"name": "DESCRIPCION SECCION",
"level": 2,
"source": "DESCRIPCION SECCION"
}
]
}
I still like arrays instead of an extra property. Example:
{
"name": "administrative",
"dimensionType": "classification",
"classificationType": "administrative"
"fields": [
[
{
"name": "CENTRO",
"source": "CENTRO"
},
{
"name": "DESCRIPCION CENTRO",
"source": "DESCRIPCION CENTRO"
},
],
[
{
"name": "SECCION",
"source": "SECCION"
},
{
"name": "DESCRIPCION SECCION",
"source": "DESCRIPCION SECCION"
}
]
]
}
edit: Just want to clarify that the reason why I prefer this for your example is that the objects of the same level are now wrapped in a container (the array) which means the data structure is more explicit.
As an implementor, which of these field properties would I use in an API (i.e. as a slug) and which would I show to a user (i.e. in a query editor of some sort)?
Have you considered a set of properties on the dimension to give an indication as to which of these fields should be used for labels (i.e. for the whole dimension) and as a PK? This might have to be per-level...
@pudo good points. Just specifically about about pks, the example I gave originally has id as pk (from earlier version of spec). The latest example I'm just working with Dan's dilemma.
I certainly hope that when things settle on the mapping/dimension stuff it will be clear what is a primary key and what is a label.
@pwalsh arrays of arrays are a straightforward solution for dimensions with hierarchies, but what about dimensions where there is no hierarchy? It would seem to me that an hash of (or array of) hashes handles both cases more cleanly.
@danfowler maybe I miss the subtleties of the various data structures from CSV we need to represent, but I don't see it in the examples we have above, anyway.
@rgrp: @pwalsh flags this as an issue we should try to resolve very soon for fiscal-data-packager work. For my part, I'm still undecided about the best way forward. In the case of nested arrays to represent hierarchical fields
, it would seem the property itself would need to be renamed to reflect that each element of the array is not a "field
".
I'm a bit lost here reading the examples. Why wouldn't we just have two dimensions - why would we merge into one dimension. Let's sync on this quickly.
Hello,
Just to share something in the Malaysian context of budgeting process at the governmental level. There is no hierarchy of budget in Malaysia. The budget process is in Parallel Situation. An example case reported in the SELCAT (Special Select Committee On Competence, https://en.wikipedia.org/wiki/Selangor_State_Legislative_Assembly#Selcat) report, 2 kids died in accidental death in PPR Flat (public housing) in Kota Damansara (which is situated in Selangor state) on 21st June 2013 and 21st Jan 2015. SELCAT did not touch nor investigate on the death of this 2 children that were classified as police case. This tragedy was caused by the lack of efficiency in terms of maintenance and monitoring for all PPR flats in the state of Selangor. The railing maintenance for PPR flat is covered by the Federal gov but not the State gov nor the local council. The road maintenance in the area is covered by the State gov and local councils but not the Federal gov. Nobody held accountable from this tragedy because there is no clear line for the flow of the money being used.
In contrast to the hierarchy of budget in Cameroon (http://cameroon.openspending.org/en/) where I could see from the visualisation in the link shown the Drill Down Situation, budgets starts from Federal (national) then goes to States and then goes to Local Councils.
Can there be flexibility to apply dimensions for a country like Malaysia where there is no hierarchy?
Example of test file for Federal budget for all ministries: https://docs.google.com/spreadsheets/d/124utUicrzqWcBA2UI6VeABJkFYp_A-OCDTwL79RRx88/export?exportFormat=csv
SELCAT committee: http://dewan.selangor.gov.my/jawatan-pilihan-khas-mengenai-keupayaan
SELCAT report (if you can read in Malay, there's no translation version in English): http://dewan.selangor.gov.my/assets/pdf/Penyata/Penyata-Sidang-2015/KERTAS%20BIL.%2043%20TAHUN%202015-SELCAT_PENYELENGGARAAN%20_%20PEMANTAUAN%20PERUMAHAN%20PPR%20DI%20SGOR.pdf
Rough ideas for now for the flow of money in Malaysia 2015 Budget:
I can't confirm this yet until I receive budget documents for Selangor State Government to see if the flow is correct or not.
What I'm thinking for dimensions:
You can see the reference code by department from the 2015 Budget displayed in the Ministry of Finance Website (http://www.treasury.gov.my/index.php?option=com_content&view=article&id=4298:2015-estimated-federal-expenditure&catid=447&Itemid=2473&lang=en):
We don't have expenditures, so that is why we need to do social audit to confirm on how issues covered without having the drill down hierarchy of budget based on the example that I have mentioned previously from the SELCAT report.
@pudo also pointed out some prior art on representing hierarchies: http://mondrian.pentaho.com/documentation/schema.php#Dimensions,_Hierarchies,_Levels
I am still not really clear what tooling this would support. The only real one is automated visualization or cues for visualization editor. Personally, I would say it would be worth waiting a bit until we are developing those systems to finalise this as it can be informed by actual use. For example, right now it is unclear whether we need to support systems in which a given level always has one and only one child or in which there are multiple children.
I do think there's a benefit to conceptual clarity, as well. Right now, we'd have constructs like datasets in which multiple dimensions (say, cofog1, cofog2, cofog3) all refer to the same thing, at a different level of granularity. This would be made explicit through hierarchies and make any sort of UI-building (not just, but also vis) easier.
FWIW, I think having them as explicit as possible, and defining a key and label attribute for each dimension hierarchy level would be really useful.
Automated visualisation is a huge win. Do we need more examples of tooling? the primary use of fiscal data out of FDP and OS is visualisation of that data.
Another clear use case would be cues to create better data structures in other, non-CSV formats when reading data out of a FDP more generally (JSON, RDF, SQL, etc.).
Great! I suggest that @pudo / @pwalsh / anyone else actually building the viz systems drive a bit here.
@pwalsh I'm personally still really quite sceptical of fully automated viz - so far I always see users needing to do something (e.g. choosing the columns to filter or aggregate on).
questions right now:
I agree with @pwalsh that a hierarchy defined through array order is less fuss than having explicit level
keys per field. But I also think that with the way we are understanding dimensions, all fields in a given dimension will not necessarily be positioned at a particular level of a hierarchy (e.g. these common dimensions have fields like "id", "title", and "description" with no hierarchy w/r/t to each other), so relying on that implicit ordering (especially for e.g. automated viz) is problematic. Also, strictly speaking, a dimension can have multiple hierarchies as well, which neither the level
key or arrays of fields can solve properly.
Proposal: let's remove the level
key per field. Instead, let's add a hierarchies
object in which each key points to an array of field name
s whose order defines a named hierarchy within the dimension.
"hierarchies": {
"default": ["centro","seccion"]
}
As far as the normalization problem in the madrid budget example, I think this might deserve a separate issue.
What do you think @pudo @rgrp @pwalsh @stevage?
@danfowler you've lost me, I'm sorry :).
I'm not sure what you mean above with this statement:
all fields in a given dimension will not necessarily be positioned at a particular level of a hierarchy ... so relying on that implicit ordering ... is problematic.
We need to clearly outline the cases with examples.
At the most basic, essential level, I see the following two cases we need to support:
Of course, this is what the spec currently covers.
# data.csv
amount,function_name,function_code
10,Education,0100
20,Arts,0200
# snippet from datapackage.json
{
"name": "function",
"dimensionType": "classification",
"fields": [
{ "name": "title", "source": "function_name" },
{ "name": "id", "source": "function_code" }
]
}
The spec has no treatment for this.
# data.csv
amount,function1_name,function1_code,function2_name,function2_code
10,Education,0100,Syllabus,0111
20,Arts,0200,Materials,0211
# snippet from datapackage.json
{
"name": "function",
"dimensionType": "classification",
"fields": [
[
{ "name": "title", "source": "function1_name" },
{ "name": "id", "source": "function1_code" }
],
[
{ "name": "title", "source": "function2_name" },
{ "name": "id", "source": "function2_code" }
]
]
}
Of course, the above nested arrays look awkward.
At some point, fields became an array where it was previously an object. I don't know why - I think an object is better.
I guess it was because in JSON Table Schema fields is an array, but there it is necessary as JSON Table Schema allows columns with duplicate names, and therefore column order is important to the spec. Here, we don't allow duplicate names, and column order is not important.
Therefore, if fields was back to being an object, I'd write the dimensions like this, which is less verbose and easier for hierarchical representation:
# normal (non-hierarchical)
{
"name": "function",
"dimensionType": "classification",
"fields": {
"id": {
"source": "function1_code"
},
"title": {
"source": "function1_name"
}
}
}
# hierarchical
{
"name": "function",
"dimensionType": "classification",
"fields": [
{
"id": {
"source": "function1_code"
},
"title": {
"source": "function1_name"
}
},
{
"id": {
"source": "function2_code"
},
"title": {
"source": "function2_name"
}
}
]
}
As far as I see, this data structure would be more pleasant to write code around:
if fields is an object, then i have a flat dimension object. if fields is an array, then I have a indexed hierarchy of dimension objects.
Lastly, about multiple hierarchies: that multiple hierarchy dimension example from the mondrian docs does not relate to any examples we've proposed here, as far as I see, so i suggest we simply ignore that data structure right now.
@pwalsh Just to double check I've understood this last example, this is a single hierarchy consisting of two levels, each of which is a two part field? That is, a file like:
description,function1_code,function1_name,function2_code,function2_name,amount
And our mooted vis tool is going to present function1_name
as a first level of grouping (presumably using function1_code
as the actual value), and the user can drill down to look at function2_name
s?
Just curious if this is derived from an actual example - levels of hierarchy where each level is spread across several fields?
Anyway, I don't have much opinion on arrays vs objects, but suspect you're probably right about objects. I'm slightly uneasy about fields that can be of different types (array of objects vs single object). But the alternative (an explicit hierarchies
property naming the dimensions just defined) also is slightly messy.
Question: in your model, would you call one hierarchy "a dimension" (as opposed to "a hierarchy of dimensions")? Is that standard nomenclature?
Anyway, as far as I can tell, Paul's proposal meets our needs and doesn't have anything major wrong with it?
@stevage yes, a single heirarchy of two levels as you describe. Unfortunately, this pattern is extremely common and hence, why we are discussing the support for it for so long.
I've given this example (as well as the trail of others reachable from this thread) as I'm eager that we get this right, but @danfowler is leading the spec at the moment and has so far had the most experience with working actual data into FDP, so I'm following his lead on this for a decision.
@danfowler you were going to summarize our discussion friday on this.
I'm inclined to agree with @pwalsh that, regardless of what we think about hierarchies, it would be useful to group multiple columns of source data that refer to the same thing (commonly, code
and description
for some level of budgetary classification) into a single concept/object in the mapping
. It is a super common structure for the budget datasets I've seen so far, and I can't see how to move forward in an implementation like fiscal data packager without pinning like fields with like. Like @stevage, I'm concerned with the idea of fields
being one of two types, so I'll add the modest change that a dimension always has a fields
array, but that a fields
array where length == 1 would indicate a non-hierarchical dimension.
I think one thing that gets in the way of thinking about this problem clearly is that once we've started grouping columns in this way, we no longer have a good name for the objects in the fields
array that makes up a given dimension. The objects that populate the fields
array of a dimension (being made up of potentially several source columns) are very different from the object in the fields
array of a JSON Table Schema (a single column each). This is not necessarily a problem, obviously, but it would be useful to think if we should rename the fields
array in this scenario.
Can there be flexibility to apply dimensions for a country like Malaysia where there is no hierarchy?
@Hazwany thanks so much for your input into this discussion. I just wanted to say that we are talking very specifically about how to model clearly defined, existing budgetary hierarchies into a budget dataset using the fiscal data package format. You should still be able to model aspects of the Malaysian budgets using dimensions even in the absence of explicit hierarchies.
@rgrp we were talking about pushing for demonstrated impact on code as well how to also support budget datasets with hierarchical levels built into compound codes which is the typical COFOG case. I must confess that I haven't progressed much in how to marry these models with the ones described above.
code | level1 | level2 | level3 |
---|---|---|---|
01.02 | 01 | 02 | |
01.02.03 | 01 | 02 | 03 |
@danfowler i think it would be worth copying the examples i had in skype of the options and also a worked through example of going from a source file to a denormalized structure we might have in a database and finally how an aggregate query would work.
Thanks all for the great discussion and feedback. I am closing this issue as:
level
keyword per dimension field
) for representing hierarchies in a dimension (as we currently understand dimensions) as of this commit.This, of course, does not mean that it is the perfect solution, so I'll open up a new issue to try to bring some clarity on how we represent dimensions more generally.
The classic case is functional classification, but it could apply to any dimension potentially (projects, entities, etc.)
How do we use
mapping.{dimension}
to represent a tree, when the tree is denormalised in a single resource?Some recent discussion around this: