Representing dimensions which are trees, denormalised in a single resource

pwalsh commented 9 years ago

The classic case is functional classification, but it could apply to any dimension potentially (projects, entities, etc.)

How do we use mapping.{dimension} to represent a tree, when the tree is denormalised in a single resource?

Some recent discussion around this:

stevage commented 9 years ago

I also just ran into this, so thanks for bringing it up :)

For the sake of concreteness, an example (one line of header, one line of data):

Portfolio,Department/Agency,Outcome,Program,Expense type,Appropriation type,Description,2014-15,2015-16,2016-17,2017-18,2018-19,Source document,Source table,URL
Treasury Portfolio,Australian Accounting Standards Board (AASB),Outcome 1: The formulation and making of accounting standards that are used by Australian entities to prepare financial reports and enable users of these reports to make informed decisions,Programme 1.1: Australian Accounting Standards Board,Departmental expenses,Departmental appropriation & Expenses not requiring appropriation in the budget year,,4706,4631,4528,4542,4555,Australian Accounting Standards Board (AASB) 2015-16 PBS,Table 2.1,https://data.gov.au/dataset/budget-2015-16-tables-and-data/resource/91bfbe9f-714b-4a7b-8579-9cb2aec85209

The first four columns are organisational structure: Portfolio,Department/Agency,Outcome,Program

I think dimensions will be most useful if we are somewhat prescriptive about different types of dimensions. Two very common types of hierarchy (ok, I'm guessing) would be "organisational structure" (departments, programmes, projects) and function (eg, COFOG). Combining that view, your suggestion to remove dimensionType and the need to express hierarchy, we'd get:

    "dimensions": {
        "organisation": {
            "fields": {
                "Portfolio": {
                    "source": "Portfolio",
                    "level": 1
                },
                "Department/Agency": {
                    "source": "Department/Agency",
                    "level": 2
                },
                "Outcome": {
                    "source": "Outcome",
                    "level": 3
                },
                "Programme": {
                    "source": "Program",
                    "level": 4
                }
            }
        },
        "function": {
            "fields": {
                "Expense type": {
                    "source": "Expense type",
                    "level": 1
                },
                "Appropriation type": {
                    "source": "Appropriation type",
                    "level": 2
                },
                "Description": {
                    "source": "Description",
                    "level": 3
                }
            }
        }
    }

Maybe there's a better way to express the hierarchy than the level attribute - but of course JSON dictionaries are unordered.

Also noting in passing that the Fiscal spec isn't very clear about what it means for a dimension to be made up of more than one field. Is it because there needs to be a primary key, and that key might be made up of multiple fields? With this dataset, I think the primary key might be the first seven fields actually. :/

rufuspollock commented 9 years ago

@stevage @pwalsh I think we are right on needing some kind of level attribute to indicate hierarachy and think using level in this way seems the obvious way to start.

@stevage dimensions can definitely have multiple fields. Re the PK issue that is common with these datasets as they don't generate a primary key. I'm actually think that in these circumstances one would want to generate a PK (the PK stuff is more essential when you already had normalised source data - if the data is denormalized linking the "dimension" with the fact table is trivial). Given this point is a bit subtle could you open a separate issue re PKs for dimensions and the fact that they may not be a single field ...

pwalsh commented 9 years ago

Why not an array of hashes, so level is explicit in the sequence, as per the example I linked to: https://discuss.okfn.org/t/open-spending-data-structure-ideas-and-suggestions/300/20

stevage commented 9 years ago

@pwalsh Ah yep, didn't see the problem that was solving. It also does away with the field property, which does seem a bit redundant? So my data would become:

{
    "dimensions": {
        "organisation": [
            {
                "source": "Portfolio",
                "id": "Portfolio"
            },
            {
                "source": "Department/Agency",
                "id": "Department/Agency"
            },
            {
                "source": "Outcome",
                "id": "Outcome"
            },
            {
                "source": "Program",
                "id": "Programme"
            }
        ],
        "function": [
            {
                "source": "Expense type",
                "id": "Expense type"
            },
            {
                "source": "Appropriation type",
                "id": "Appropriation type"
            },
            {
                "source": "Description",
                "id": "Description"
            }
        ]
    }
}

I didn't notice till now, but one of the fields, Department/Agency, contains a slash, and is hence problematic for the source property?

@rgrp:

Given this point is a bit subtle could you open a separate issue re PKs for dimensions and the fact that they may not be a single field ...

I'm not sure I know what the issue with PKs and fields is clearly enough to raise an issue. Hmm.

rufuspollock commented 9 years ago

I've done some work to support levels in the work on #34 in https://github.com/openspending/fiscal-data-package/commit/1190f199504eaedd8cbed61fc3a7e88f93092c16

The basic answer here is that we can handle the denormed structure in the mapping. I should probably do an example for this as well as the basic structure that is there so leaving this open.

pwalsh commented 9 years ago

@rgrp looks like you combined the two approaches above?

@stevage first suggested fields as a hash of hashes, using a level property on those nested hashes. I then showed an earlier example I did where fields is an array and thereby level is indicated by the index of the hash in the array.

The example in the commit uses an array and also each hash has a level property.

danfowler commented 9 years ago

@pwalsh @rgrp I think the implicit ordering with an array works well when the each object can express multiple mappings at the same level (as in your example: id -> budget_tree/id-1, title -> budget_tree/title-1). In the current specification, we're ordering fields, some of which could exist at the same classification level. So I would think an object of objects would be preferable because the array can't implicitly express that two elements are at the same classification level.

Given the Madrid Municipal Budgets dataset:

AÑO,CENTRO,DESCRIPCION CENTRO,SECCION,DESCRIPCION SECCION,PROGRAMA,DESCRIPCION PROGRAMA,CAPITULO,DESCRIPCION CAPITULO,ECONOMICO,DESCRIPCION ECONOMICO,IMPORTE
2012,001,AYUNTAMIENTO DE MADRID,002,PRESIDENCIA DEL PLENO,91100,SECRETARÍA GENERAL DEL PLENO,1,GASTOS DE PERSONAL,12000,SUELDOS DEL GRUPO A1,102739
2012,001,AYUNTAMIENTO DE MADRID,002,PRESIDENCIA DEL PLENO,91100,SECRETARÍA GENERAL DEL PLENO,1,GASTOS DE PERSONAL,12001,SUELDOS DEL GRUPO A2,29354
2012,001,AYUNTAMIENTO DE MADRID,002,PRESIDENCIA DEL PLENO,91100,SECRETARÍA GENERAL DEL PLENO,1,GASTOS DE PERSONAL,12003,SUELDOS DEL GRUPO C1,215832

I would express the administrative classification dimension (CENTRO/SECCION: columns 2-5) this way:

 {
  "name": "administrative",
  "dimensionType": "classification",
  "classificationType": "administrative"
  "fields": [
    {
      "name": "CENTRO",
      "level": 1,
      "source": "CENTRO"
    },
    {
      "name": "DESCRIPCION CENTRO",
      "level": 1,
      "source": "DESCRIPCION CENTRO"
    },
    {
      "name": "SECCION",
      "level": 2,
      "source": "SECCION"
    },
    {
      "name": "DESCRIPCION SECCION",
      "level": 2,
      "source": "DESCRIPCION SECCION"
    }
  ]
}

pwalsh commented 9 years ago

I still like arrays instead of an extra property. Example:

 {
  "name": "administrative",
  "dimensionType": "classification",
  "classificationType": "administrative"
  "fields": [
    [
      {
        "name": "CENTRO",
        "source": "CENTRO"
      },
      {
        "name": "DESCRIPCION CENTRO",
        "source": "DESCRIPCION CENTRO"
      },
    ],
    [
      {
        "name": "SECCION",
        "source": "SECCION"
      },
      {
        "name": "DESCRIPCION SECCION",
        "source": "DESCRIPCION SECCION"
      }
    ]
  ]
}

edit: Just want to clarify that the reason why I prefer this for your example is that the objects of the same level are now wrapped in a container (the array) which means the data structure is more explicit.

pudo commented 9 years ago

As an implementor, which of these field properties would I use in an API (i.e. as a slug) and which would I show to a user (i.e. in a query editor of some sort)?

Have you considered a set of properties on the dimension to give an indication as to which of these fields should be used for labels (i.e. for the whole dimension) and as a PK? This might have to be per-level...

pwalsh commented 9 years ago

@pudo good points. Just specifically about about pks, the example I gave originally has id as pk (from earlier version of spec). The latest example I'm just working with Dan's dilemma.

I certainly hope that when things settle on the mapping/dimension stuff it will be clear what is a primary key and what is a label.

danfowler commented 9 years ago

@pwalsh arrays of arrays are a straightforward solution for dimensions with hierarchies, but what about dimensions where there is no hierarchy? It would seem to me that an hash of (or array of) hashes handles both cases more cleanly.

pwalsh commented 9 years ago

@danfowler maybe I miss the subtleties of the various data structures from CSV we need to represent, but I don't see it in the examples we have above, anyway.

danfowler commented 9 years ago

@rgrp: @pwalsh flags this as an issue we should try to resolve very soon for fiscal-data-packager work. For my part, I'm still undecided about the best way forward. In the case of nested arrays to represent hierarchical fields, it would seem the property itself would need to be renamed to reflect that each element of the array is not a "field".

rufuspollock commented 9 years ago

I'm a bit lost here reading the examples. Why wouldn't we just have two dimensions - why would we merge into one dimension. Let's sync on this quickly.

Hazwany commented 9 years ago

Hello,

Just to share something in the Malaysian context of budgeting process at the governmental level. There is no hierarchy of budget in Malaysia. The budget process is in Parallel Situation. An example case reported in the SELCAT (Special Select Committee On Competence, https://en.wikipedia.org/wiki/Selangor_State_Legislative_Assembly#Selcat) report, 2 kids died in accidental death in PPR Flat (public housing) in Kota Damansara (which is situated in Selangor state) on 21st June 2013 and 21st Jan 2015. SELCAT did not touch nor investigate on the death of this 2 children that were classified as police case. This tragedy was caused by the lack of efficiency in terms of maintenance and monitoring for all PPR flats in the state of Selangor. The railing maintenance for PPR flat is covered by the Federal gov but not the State gov nor the local council. The road maintenance in the area is covered by the State gov and local councils but not the Federal gov. Nobody held accountable from this tragedy because there is no clear line for the flow of the money being used.

screen shot 2015-11-06 at 4 37 10 pm

In contrast to the hierarchy of budget in Cameroon (http://cameroon.openspending.org/en/) where I could see from the visualisation in the link shown the Drill Down Situation, budgets starts from Federal (national) then goes to States and then goes to Local Councils.

Can there be flexibility to apply dimensions for a country like Malaysia where there is no hierarchy?

Example of test file for Federal budget for all ministries: https://docs.google.com/spreadsheets/d/124utUicrzqWcBA2UI6VeABJkFYp_A-OCDTwL79RRx88/export?exportFormat=csv

SELCAT committee: http://dewan.selangor.gov.my/jawatan-pilihan-khas-mengenai-keupayaan

SELCAT report (if you can read in Malay, there's no translation version in English): http://dewan.selangor.gov.my/assets/pdf/Penyata/Penyata-Sidang-2015/KERTAS%20BIL.%2043%20TAHUN%202015-SELCAT_PENYELENGGARAAN%20_%20PEMANTAUAN%20PERUMAHAN%20PPR%20DI%20SGOR.pdf

Hazwany commented 9 years ago

Rough ideas for now for the flow of money in Malaysia 2015 Budget: screen shot 2015-11-06 at 10 15 18 pm

I can't confirm this yet until I receive budget documents for Selangor State Government to see if the flow is correct or not.

What I'm thinking for dimensions: screen shot 2015-11-06 at 10 29 31 pm

You can see the reference code by department from the 2015 Budget displayed in the Ministry of Finance Website (http://www.treasury.gov.my/index.php?option=com_content&view=article&id=4298:2015-estimated-federal-expenditure&catid=447&Itemid=2473&lang=en): screen shot 2015-11-06 at 10 20 49 pm

screen shot 2015-11-06 at 10 26 53 pm

We don't have expenditures, so that is why we need to do social audit to confirm on how issues covered without having the drill down hierarchy of budget based on the example that I have mentioned previously from the SELCAT report.

pwalsh commented 9 years ago

@pudo also pointed out some prior art on representing hierarchies: http://mondrian.pentaho.com/documentation/schema.php#Dimensions,_Hierarchies,_Levels

rufuspollock commented 9 years ago

I am still not really clear what tooling this would support. The only real one is automated visualization or cues for visualization editor. Personally, I would say it would be worth waiting a bit until we are developing those systems to finalise this as it can be informed by actual use. For example, right now it is unclear whether we need to support systems in which a given level always has one and only one child or in which there are multiple children.

pudo commented 9 years ago

I do think there's a benefit to conceptual clarity, as well. Right now, we'd have constructs like datasets in which multiple dimensions (say, cofog1, cofog2, cofog3) all refer to the same thing, at a different level of granularity. This would be made explicit through hierarchies and make any sort of UI-building (not just, but also vis) easier.

FWIW, I think having them as explicit as possible, and defining a key and label attribute for each dimension hierarchy level would be really useful.

pwalsh commented 9 years ago

Automated visualisation is a huge win. Do we need more examples of tooling? the primary use of fiscal data out of FDP and OS is visualisation of that data.

Another clear use case would be cues to create better data structures in other, non-CSV formats when reading data out of a FDP more generally (JSON, RDF, SQL, etc.).

rufuspollock commented 9 years ago

Great! I suggest that @pudo / @pwalsh / anyone else actually building the viz systems drive a bit here.

@pwalsh I'm personally still really quite sceptical of fully automated viz - so far I always see users needing to do something (e.g. choosing the columns to filter or aggregate on).

questions right now:

Do we just need single direction hierarchy? If so "level" value could work (otherwise parent stuff needed and processing model to be worked out)
How do we deal with normalization "problem" - cf madrid budget example from @danfowler above.

danfowler commented 8 years ago

I agree with @pwalsh that a hierarchy defined through array order is less fuss than having explicit level keys per field. But I also think that with the way we are understanding dimensions, all fields in a given dimension will not necessarily be positioned at a particular level of a hierarchy (e.g. these common dimensions have fields like "id", "title", and "description" with no hierarchy w/r/t to each other), so relying on that implicit ordering (especially for e.g. automated viz) is problematic. Also, strictly speaking, a dimension can have multiple hierarchies as well, which neither the level key or arrays of fields can solve properly.

Proposal: let's remove the level key per field. Instead, let's add a hierarchies object in which each key points to an array of field names whose order defines a named hierarchy within the dimension.

"hierarchies": {
  "default": ["centro","seccion"]
}

As far as the normalization problem in the madrid budget example, I think this might deserve a separate issue.

What do you think @pudo @rgrp @pwalsh @stevage?

pwalsh commented 8 years ago

@danfowler you've lost me, I'm sorry :).

I'm not sure what you mean above with this statement:

all fields in a given dimension will not necessarily be positioned at a particular level of a hierarchy ... so relying on that implicit ordering ... is problematic.

We need to clearly outline the cases with examples.

At the most basic, essential level, I see the following two cases we need to support:

non-hierarchical dimensions
hierarchical dimensions

non-hierarchical dimensions

Of course, this is what the spec currently covers.

# data.csv
amount,function_name,function_code
10,Education,0100
20,Arts,0200

# snippet from datapackage.json
{
  "name": "function",
  "dimensionType": "classification",
  "fields": [
    { "name": "title", "source": "function_name" },
    { "name": "id", "source": "function_code" }
  ]
}

hierarchical dimensions

The spec has no treatment for this.

# data.csv
amount,function1_name,function1_code,function2_name,function2_code
10,Education,0100,Syllabus,0111
20,Arts,0200,Materials,0211

# snippet from datapackage.json
{
  "name": "function",
  "dimensionType": "classification",
  "fields": [
    [
      { "name": "title", "source": "function1_name" },
      { "name": "id", "source": "function1_code" }
    ],
    [
      { "name": "title", "source": "function2_name" },
      { "name": "id", "source": "function2_code" }
    ]
  ]
}

Of course, the above nested arrays look awkward.

At some point, fields became an array where it was previously an object. I don't know why - I think an object is better.

I guess it was because in JSON Table Schema fields is an array, but there it is necessary as JSON Table Schema allows columns with duplicate names, and therefore column order is important to the spec. Here, we don't allow duplicate names, and column order is not important.

Therefore, if fields was back to being an object, I'd write the dimensions like this, which is less verbose and easier for hierarchical representation:

# normal (non-hierarchical)
{
  "name": "function",
  "dimensionType": "classification",
  "fields": {
    "id": {
      "source": "function1_code"
    },
    "title": {
      "source": "function1_name"
    }
  }
}

# hierarchical
{
  "name": "function",
  "dimensionType": "classification",
  "fields": [
    {
      "id": {
        "source": "function1_code"
      },
      "title": {
        "source": "function1_name"
      }
    },
    {
      "id": {
        "source": "function2_code"
      },
      "title": {
        "source": "function2_name"
      }
    }
  ]
}

As far as I see, this data structure would be more pleasant to write code around:

if fields is an object, then i have a flat dimension object. if fields is an array, then I have a indexed hierarchy of dimension objects.

Lastly, about multiple hierarchies: that multiple hierarchy dimension example from the mondrian docs does not relate to any examples we've proposed here, as far as I see, so i suggest we simply ignore that data structure right now.

stevage commented 8 years ago

@pwalsh Just to double check I've understood this last example, this is a single hierarchy consisting of two levels, each of which is a two part field? That is, a file like:

description,function1_code,function1_name,function2_code,function2_name,amount

And our mooted vis tool is going to present function1_name as a first level of grouping (presumably using function1_code as the actual value), and the user can drill down to look at function2_names?

Just curious if this is derived from an actual example - levels of hierarchy where each level is spread across several fields?

Anyway, I don't have much opinion on arrays vs objects, but suspect you're probably right about objects. I'm slightly uneasy about fields that can be of different types (array of objects vs single object). But the alternative (an explicit hierarchies property naming the dimensions just defined) also is slightly messy.

Question: in your model, would you call one hierarchy "a dimension" (as opposed to "a hierarchy of dimensions")? Is that standard nomenclature?

Anyway, as far as I can tell, Paul's proposal meets our needs and doesn't have anything major wrong with it?

pwalsh commented 8 years ago

@stevage yes, a single heirarchy of two levels as you describe. Unfortunately, this pattern is extremely common and hence, why we are discussing the support for it for so long.

I've given this example (as well as the trail of others reachable from this thread) as I'm eager that we get this right, but @danfowler is leading the spec at the moment and has so far had the most experience with working actual data into FDP, so I'm following his lead on this for a decision.

rufuspollock commented 8 years ago

@danfowler you were going to summarize our discussion friday on this.

danfowler commented 8 years ago

I'm inclined to agree with @pwalsh that, regardless of what we think about hierarchies, it would be useful to group multiple columns of source data that refer to the same thing (commonly, code and description for some level of budgetary classification) into a single concept/object in the mapping. It is a super common structure for the budget datasets I've seen so far, and I can't see how to move forward in an implementation like fiscal data packager without pinning like fields with like. Like @stevage, I'm concerned with the idea of fields being one of two types, so I'll add the modest change that a dimension always has a fields array, but that a fields array where length == 1 would indicate a non-hierarchical dimension.

I think one thing that gets in the way of thinking about this problem clearly is that once we've started grouping columns in this way, we no longer have a good name for the objects in the fields array that makes up a given dimension. The objects that populate the fields array of a dimension (being made up of potentially several source columns) are very different from the object in the fields array of a JSON Table Schema (a single column each). This is not necessarily a problem, obviously, but it would be useful to think if we should rename the fields array in this scenario.

danfowler commented 8 years ago

Can there be flexibility to apply dimensions for a country like Malaysia where there is no hierarchy?

@Hazwany thanks so much for your input into this discussion. I just wanted to say that we are talking very specifically about how to model clearly defined, existing budgetary hierarchies into a budget dataset using the fiscal data package format. You should still be able to model aspects of the Malaysian budgets using dimensions even in the absence of explicit hierarchies.

danfowler commented 8 years ago

@rgrp we were talking about pushing for demonstrated impact on code as well how to also support budget datasets with hierarchical levels built into compound codes which is the typical COFOG case. I must confess that I haven't progressed much in how to marry these models with the ones described above.

code	level1	level2	level3
01.02	01	02
01.02.03	01	02	03

rufuspollock commented 8 years ago

@danfowler i think it would be worth copying the examples i had in skype of the options and also a worked through example of going from a source file to a denormalized structure we might have in a database and finally how an aggregate query would work.

danfowler commented 8 years ago

Thanks all for the great discussion and feedback. I am closing this issue as:

We actually do have an existing treatment in the spec (level keyword per dimension field) for representing hierarchies in a dimension (as we currently understand dimensions) as of this commit.
Various points were raised in the discussion (e.g. how to represent labels and ids, what exactly is dimension field) that, although obviously interrelated, do merit their own issue and discussion.

This, of course, does not mean that it is the perfect solution, so I'll open up a new issue to try to bring some clarity on how we represent dimensions more generally.

openspending / fiscal-data-package

Representing dimensions which are trees, denormalised in a single resource #48

non-hierarchical dimensions

hierarchical dimensions