Datasets with no date column

danfowler commented 9 years ago

Many CSVs don't have explicit date columns (e.g. Madrid municipal budgets) but rather an implicit date at the file level. One could specify fiscalPeriod at the top-level of the datapackage, but (if I understand correctly) those are two different concepts (e.g. the file could be a month's spending data while the fiscalPeriod would be the quarter/year in which that spending occurred).

If our view into the data is through the logical model, how to expose the "Date / Time" information about the data via the logical model without, say, explicitly modifying the physical model (CSV) by adding a date column and mapping it through?

pudo commented 9 years ago

Just to emphasise this once more: there is no metadata/data distinction on this stuff :)

https://discuss.okfn.org/t/a-thought-about-fiscal-data-standardisation/949

rufuspollock commented 9 years ago

There are two possible cases here:

The dataset has one consistent year. In this case it does not really matter that this is then in the "data" and this may be specified as fiscalPeriod and that is it.
The source data has multiple files corresponding to different years or dates. In this case I'd recommend transforming the data when you data package it and adding a column for the date.

In either case I don't think we need to change the metadata here.

pwalsh commented 9 years ago

@rgrp if specified as fiscalPeriod, then as far as I see, we have no way to map this into the mapping as a date dimension.

mapping is currently grounded in the concept of physical > logical model, and explicitly sees the physical model as "the files in the resources array".

Because there is, at least implicitly, a distinction here between metadata and data, it makes it hard for me to see how we could, in the first scenario you highlight above, use fiscalPeriod to inform a datetime dimension in the mapping object.

rufuspollock commented 9 years ago

@pwalsh the point is you would not map this into the date dimension as it wouldn't matter - the data only has one period which is the same throughout so you could leave it out. That's the suggestion there.

stevage commented 9 years ago

Dan and I chatted about this. I first wanted to check whether we intend to support these three ways of expressing dates through CSV files:

1.Year as a dimension

amount,year
1000,2014
1200,2015

2.Year described in datapackage.json - one year for whole file

amount
1000

3. Multiple years side by side (example)

amount_2014, amount_2015
1000,1200

If we do intend to support all of these (and afaik, they're all relatively common), then we have some questions:

Can a datapackage.json contain multiple fiscal years?
If so, how would that be expressed for side-by-side format (3)? Something like this perhaps:

"measures": {
      "amount": {
        "source": "2015-16",
        "currency": "AUD",
        "factor": 1000,
        "year": 2016
      },
      "amount2014": {
        "source": "2014-15",
        "currency": "AUD",
        "factor": 1000,
        "year": 2014
      }
    }

If some of these formats are not going to be supported, perhaps we should have clear statements about what those are. That is, what the limitations of the physical->logical mapping are.

pwalsh commented 9 years ago

@stevage i understand the general gist of what you are saying, but the example confuses me as I can't imagine a single budget line could have two measures where each measure represents a different year? It seems at the edge of edge cases to me.

stevage commented 9 years ago

You're talking about format 3? I linked to an example...which is the Australian federal budget format. Source:

http://data.gov.au/dataset/5b54386d-4b46-4736-87c5-28bd5ee38bcc/resource/365051dd-9335-4c2b-8c29-331718e079eb/download/201505181230budget1516.csv

http://datapipes.okfnlabs.org/csv/head%C2%A010/html?url=http://data.gov.au/dataset/5b54386d-4b46-4736-87c5-28bd5ee38bcc/resource/365051dd-9335-4c2b-8c29-331718e079eb/download/201505181230budget1516.csv

Portfolio,Department/Agency,Outcome,Program,Expense type,Appropriation type,Description,2014-15,2015-16,2016-17,2017-18,2018-19,Source document,Source table,URL
Treasury Portfolio,Australian Accounting Standards Board (AASB),Outcome 1: The formulation and making of accounting standards that are used by Australian entities to prepare financial reports and enable users of these reports to make informed decisions,Programme 1.1: Australian Accounting Standards Board,Departmental expenses,Departmental appropriation & Expenses not requiring appropriation in the budget year,,4706,4631,4528,4542,4555,Australian Accounting Standards Board (AASB) 2015-16 PBS,Table 2.1,https://data.gov.au/dataset/budget-2015-16-tables-and-data/resource/91bfbe9f-714b-4a7b-8579-9cb2aec85209

I'm not sure if that's a "single budget line" in your terminology, but it's a single line in a CSV.

The difference between format 1 and 3 is really just the difference between "long format" and "wide format" (also known as normalized vs denormalized).

pwalsh commented 9 years ago

@stevage yeah ok, I remember this pattern from the Australian data. I'm not sure what I think of it really.... but, great example.

pwalsh commented 9 years ago

Dumping notes for later:

date/source: "budget/date" # physical mapping
date/source: 2015 # virtual
date/source: "fiscalPeriod" # descriptor mapping

pwalsh commented 9 years ago

So just a small elaboration on my note above.

The original problem @danfowler demonstrates above is just an instance of something more generic to solve in the spec.

As I noted above, the mapping object of the spec was born out of mapping a physical model to a logical one, with the further explicit assumption that the physical model is what is in the CSV resources.

But, here we have an example where an important dimension (date) is not declared in the physical model: it could be any other dimension really but this scenario for date is quite common in budgetary data.

If we accept, as I'd argue, that our mapping should be the single point of truth for our data access, then, we need to accept that the mapping must map not only to columns of a CSV resource, but also to one or both of:

a raw value
another top-level property of the Fiscal Data Package descriptor (which itself represents a raw value)

By example, my suggestion is that the source value of any dimension can be used in each of the three following ways:

# current use
{
  ...
  "fiscalPeriod": 2015,
  "resources": [
    {
      "name": "budget",
      "title": "Budget",
      "path": "budget.csv",
      "schema": {
        "fields": [
          {
            "name": "expenditure",
            "type": "number",
            "format": "currency"
          },
          {
            "name": "year",
            "type": "date"
          }
        ]
      }
    }
  ],
  "mapping": {
    "measures": [
      {
        "name": "amount",
        "source": "expenditure",
        "currency": "USD",
        "factor": 1
      }
    ],
    "dimensions": [
      {
        "name": "date",
        "fields": [
          {
            "name": "year",
            "source": "year"
          }
        ]
      }
    ]
  }
}

# map directly to a value. take this value for each budget line
{
  ...
  "fiscalPeriod": 2015,
  "resources": [
    {
      "name": "budget",
      "title": "Budget",
      "path": "budget.csv",
      "schema": {
        "fields": [
          {
            "name": "expenditure",
            "type": "number",
            "format": "currency"
          }
        ]
      }
    }
  ],
  "mapping": {
    "measures": [
      {
        "name": "amount",
        "source": "expenditure",
        "currency": "USD",
        "factor": 1
      }
    ],
    "dimensions": [
      {
        "name": "date",
        "fields": [
          {
            "name": "year",
            "source": 2015
          }
        ]
      }
    ]
  }
}

# map to a top-level property of the descriptor. take this value for each budget line.
{
  ...
  "fiscalPeriod": 2015,
  "resources": [
    {
      "name": "budget",
      "title": "Budget",
      "path": "budget.csv",
      "schema": {
        "fields": [
          {
            "name": "expenditure",
            "type": "number",
            "format": "currency"
          }
        ]
      }
    }
  ],
  "mapping": {
    "measures": [
      {
        "name": "amount",
        "source": "expenditure",
        "currency": "USD",
        "factor": 1
      }
    ],
    "dimensions": [
      {
        "name": "date",
        "fields": [
          {
            "name": "year",
            "source": "fiscalPeriod"
          }
        ]
      }
    ]
  }
}

rufuspollock commented 9 years ago

@pwalsh i think at least the middle option seems really interesting though we would need to think about it e.g. i think overloading "source" is probably a bad idea. I'm a -0 i think on the third option of referencing fiscalPeriod as I think it adds cognitive complexity and indirection.

@danfowler @pwalsh @stevage if we added this approach to the spec would it work for people?

rufuspollock commented 9 years ago

There are various cases:

A dataset with one year / time period but not specified in the CSV
A dataset with multiple periods "denormed" in the dataset - like the one @stevage showed
A dataset with multiple periods one in each CSV - whether we even allow split data is a question for #75

My sense is that both items 2+3 introduce a huge amount of complexity for consumers. Basically you have to create the date dimension and then correctly create values for that column. In addition defining them a dimension is clearly not right and trying to create FKs to date dimension is tricky. My feeling in this case is we leave it to data provider to do this normalization rather than attempting to describe how to do it in the spec. That's my inclination but not a certainty so comments are welcome.

On point 1 I'm still in two minds. I just don't know how common it is and how often it will end up merging into 2 or 3.

pwalsh commented 9 years ago

I think point 1 is incredibly common.

As per my examples, and because of this common scenario, I think we need a pattern in dimensions that allows setting a value, and not just a pointer to a column.

It just seems so incredibly useful to me right now that I'm looking for someone to make a strong counter argument to show it is wrong :).

@rgrp about points 2 and 3 I def agree that we'd require transforms on the source data.

rufuspollock commented 9 years ago

@pwalsh ok, let's go for solving point 1 right now and continue to consider point 2 and 3 (maybe, we'll find it isn't too tough to support these in client apps ...)

stevage commented 9 years ago

[1] date/source: "budget/date" # physical mapping [2] date/source: 2015 # virtual [3] date/source: "fiscalPeriod" # descriptor mapping

Thanks so much for starting this discussion. A slightly different way of looking at this is saying we're allowing expressions [3], including literals [2], instead of only allowing field references [1].

In general, I think allowing literals in a data package is probably useful (and may solve other problems beyond this one), but there should be a clean way of expressing them. Like Rufus, I'm uneasy about relying on the difference in type (ie, 2015 instead of "2015") to distinguish between a field reference and a literal.

Maybe an approach here would be to merge [1], [2] and [3] using something like JSON schema syntax:

// property on the resources object in the data package. To define a literal, then just make sure
// there is a literal somewhere that we can point to
{
  source: { $ref: "#/resources/foo/fiscalPeriod" }
}

// referring to a column 'year' in a resource called 'foo'
{
  source: { $ref: "foo/year" }
}

I'm not really proposing literally that syntax, but something along those lines?

pwalsh commented 9 years ago

@stevage ok, so you are essentially collapsing [2] and [3] in the same implementation, where, we prefer an expression rather than a literal, due to the additional overhead that allowing literals could introduce. I think I agree. It still gives the big win of a value that is not described in the physical model.

Slightly adapting the example from JSON Schema, and simplifying it here, we'd have to special-case #, to indicate a reference to the data package itself, and do:

{
  source: "#fiscalPeriod"
}

{
  source: "foo/year"
}

rufuspollock commented 9 years ago

Hmmm, we're starting to get quite complex and I'm always wary of premature generalization (because I done it a lot ;-) - and regretted it later).

I do agree with @stevage that we can't overload the source field and probably need to use a different field or have something special. Suggestions welcome there :-)

danfowler commented 9 years ago

@rgrp I agree that this is starting to get quite complex.

I don't think we should support the case of referring to a top-level descriptor from a measure because I think we're already coming to some agreement as to what belongs at the top-level as metadata and what belongs on the measure. See: https://github.com/openspending/fiscal-data-package/issues/35

As far as referring to raw vales, the Data Package format, of which Fiscal Data Package is a superset, already has a method of specifying inline data in a resource in the datapackage.json: http://dataprotocols.org/data-packages/#inline-data . In the case of filling a missing year, perhaps a pattern we can use here is to add an additional resource like so:

{
  "name": "year",
  "format": "json",
  "data": [
     { "year": 2015 }
  ]
}

And map to that as usual from within the mapping object. What do you think of this @pwalsh @stevage?

pwalsh commented 9 years ago

Following the inline data pattern from the Data Package spec is much more complex :).

It is a significantly pumped up version of what I was suggesting, which was just to map to a single value that applies for all budget lines, and it opens up new space for error, for example, a data array of a different length to the array of values in the primary resource being mapped against.

danfowler commented 9 years ago

@pwalsh my thinking with this proposal was that, in the implementation, single row, "virtual" tables like this could be mapped to all budget lines: e.g.

date

year
2005

budget

name	amount
Thing 1	1000
Thing 2	2000

SELECT * FROM budget, date would yield:

name	amount	year
Thing 1	1000	2005
Thing 2	2000	2005

stevage commented 8 years ago

So, summarising the dilemma:

if we're trying to embed (a tiny piece of) data in a data package, it makes sense to use use the inline-data mechanism
but that's a very powerful mechanism, which creates complexity
so can we support only a very narrow subset of it?

How about this:

{
  "name": "constants",
  "format": "json",
  "data": [
     { "year": 2015 }
  ]
}

By restricting the name to constants, I think we're making clear that we're only allowing it for a very narrow use case, we don't want arrays, and it also solves the problem implicitly raised that we have two things to name (the whole resource, and the data attribute itself).

stevage commented 8 years ago

Of course, what this doesn't solve is if we have several data files (resources) and they each want their own constant.

A simpler approach?

Maybe the saner approach would just be to define constants the resource: a number of virtual columns with constant values:

{
  "resources": [
    {
      "path": "data.csv",
      "schema": {
        "fields": [
...
          {
            "name": "year",
            "type": "number",
            "constant": 2015
          }
        ]
      }
    }
  ]
}

What do you think? It seems pretty simple and is hopefully is easy to implement. You just end up with one extra column in the physical model, and you can do what you like with it in the logical model - no need for special syntax etc.

danfowler commented 8 years ago

Thanks so much for the input @stevage! I think we've had a pretty fruitful discussion here, and I'd like to finally come to some resolution.

So! Given that:

This is a really common pattern for datasets
There is value in exposing repeated data like this through a common source (e.g. mapping) per @pwalsh
Overloading "source" seems like a bad idea per @rgrp

I would like to propose that we follow @stevage's suggestion of adding a constant key. The amendment I make to @stevage suggestion is that it be located in a mapping:dimension:field object. While there is some conceptual clarity behind keeping all source data for the mapping in the resources object (which is why I made the suggestion for the inline-data approach), I do want to keep the definition and behaviour of resources as it is described in the Data Package and Tabular Data Package specifications.

So, concretely, the suggestion is that a field object MUST contain name and either a source or constant key. So, to take the earlier example from @pwalsh:

# map directly to a value. take this value for each budget line
{
  ...
  "fiscalPeriod": 2015,
  "resources": [
    {
      "name": "budget",
      "title": "Budget",
      "path": "budget.csv",
      "schema": {
        "fields": [
          {
            "name": "expenditure",
            "type": "number",
            "format": "currency"
          }
        ]
      }
    }
  ],
  "mapping": {
    "measures": [
      {
        "name": "amount",
        "source": "expenditure",
        "currency": "USD",
        "factor": 1
      }
    ],
    "dimensions": [
      {
        "name": "date",
        "fields": [
          {
            "name": "year",
            "constant": "2015"
          }
        ]
      }
    ]
  }
}

WDYT? @pwalsh @rgrp @pudo @stevage

pwalsh commented 8 years ago

+1 but "constant": 2015 in this case.

stevage commented 8 years ago

It's fine by me. I still slightly prefer putting it in resources but:

, I do want to keep the definition and behaviour of resources as it is described in the Data Package and Tabular Data Package specifications.

seems fair enough. I can't see a specific rule that this would violate, but it does seem to be an unstated assumption in JSON table schema that the schema describes actual fields in the table. If we started getting creative here, who knows what might break.

danfowler commented 8 years ago

:+1:

stevage commented 8 years ago

Hmm, I'm just coming back to this again while attempting to actually write an FDP for a real dataset.

Am I misunderstanding, or does the final proposal here actually limit us to constants that apply to the entire dataset? And if so, it only solves this case (#2 from way back in this issue):

amount
1000

and doesn't solve this case at all (#3):

amount_2014, amount_2015
1000,1200

Am I getting this right? If so, I'm sorry that I didn't notice earlier.

I guess, now that I think about it more deeply, for FDP to support this "multiple years of data in different columns" format, it would have to:

Allow each measure to specify a year; and
Allow an "implicit dimension" to somehow be specified.

That is, to say "There is a dimension called year, but it's not represented as an actual column of data. Instead, you can find its values by looking for the field year on all the measures." In essence, we're trying to normalise the data on the fly, by stacking a series of resource columns into a single measure, and create an associated dimension with as many values as there were resource columns.

I'm pretty tempted at this point to suggest that either:

We stop trying to solve general cases, and explicitly just handle years as the only dimension of this type that we care about; or
We don't try to support this format of data at all. (Or possibly look at handling data of this type as a conversion process during import).

I'm still holding out against 2, so, could I suggest something simpler like:

"mapping": {
    "measures": [
      {
        "name": "amount",
        "sources": [ 
          "amount_2014": { "year": 2014 },
          "amount_2015": { "year": 2015 }
        ],
        // do we need an explicit reference to the dimension `date` here?
      },
    ],
    "dimensions": [
      {
        "name": "Budget year",
        "dimensionType": "year", // a new type
        "values": [2014, 2015]
      }
    ]

Hopefully this satisfies @rgrp's desire to avoid premature generalisation, because it handles this one use case very well, and doesn't really try to handle anything else.

Sorry for opening the can of worms back up!

rufuspollock commented 8 years ago

@stevage I hear you but my sense as per earlier comment is to draw the line at the denormed year case - i.e. where each year as a column.

Instead, we require that the users normalize a bit before FDP'ing.

To solve denormed years in FDP would require getting quite complex and additional complexity in the consuming software. As such, I'd vote this is one where the data has to be reworked to be described properly by FDP.

Open to discussion on this one but I emphasize the general philosophy that we aren't just about describing arbitrary data but striking a balance between simplicity for publishers and simplicity for consuming tools. This is one where, ultimately, I'd suggest the publisher do a bit of transformation.

danfowler commented 8 years ago

@stevage here's an example of the kind of required normalization @rgrp is talking about: http://fiscal.dataprotocols.org/examples/transform-needed/

stevage commented 8 years ago

Sure. I feel strangely tempted to come up with a way in JSON to describe those transformations, and then embed those in the data package :) But then you end up with "first-class" and "second-class" FDP citizens. The first-class are normalised and easy for anyone to work with, while the second-class ones require a transformation step to turn into a first-class FDP.

If FDP doesn't support these denormalised datasets (and just to be clear, in my very limited experience working with financial datasets, I have come across these from two independent Australian government sources: the federal treasury, and the ABS), then I'm going to have to look at implementing that transformation anyway. And if I'm going to do that, I'm going to script it, which means that script could be part of whatever the new data import process is anyway... :)

openspending / fiscal-data-package

Datasets with no date column #59

A simpler approach?