Documentation of output format and API

rufuspollock commented 8 years ago

How does this library work and what is the structure of the output files?

pudo commented 8 years ago

The idea is to have two input files, a CSV (or other tabular) file and a YAML document that would contain a specification for a) the logical model (i.e. what columns belong together, which ones are measures), b) a database config (if the result isn't supposed to be in-memory) and c) information on which aggregates to generate from the data. The latter is done by specifying a set of filters (e.g. to separate files per year, or for expenditure/revenue), and a set of hierarchies (basically: "break down by cofog1, followed by cofog2, followed by cofog3").

An example of this spec is here. Running the script with a skeleton spec file will auto-generate the rest and print it.

$ cubepress test.yaml

I don't think I ever got around to implementing the output format properly, but it does spit out JSON files which are named after a hashed version of the query used to generate the data. I think that a metadata JSON file and a set of CSV data files might be a better option.

Finally, I really think that using babbage would allow this to run basically without any SQL in this package. I'd encourage you to at least have a quick look at the cube API. The spec format for the cubepress and babbageOLAP models is pretty much identical.

There's also a matured version of the table parsing and loading code here.

rufuspollock commented 8 years ago

@pudo really useful. I agree that output as CSV rather than (or in addition to) JSON is probably better.

babbage looks really interesting and, again, spot on in terms of the various options for computing aggregates I've been thinking about:

flat-file plus scripting (or even map reduce)
load to SQL (OLAP style) and do queries
load to Redshift (sort of SQL) and do queries
load to BigQuery and do queries
....

As i understand cubepress is basically option 1 and babbage is option 2.

Right now I think one useful thing - and the purpose of the post - was to agree an output format for flat-file aggregates that could be consumed by the viz layer (or views generally) - that was the logic of @danfowler's post https://discuss.okfn.org/t/openspending-next-visualization-experiment/886. This spec could then be common to all of the computation methods. On this I think CSV + (optionally) some kind of JSON descriptor seems a great way to go.

pudo commented 8 years ago

Agreed, metadata is everything; the value of having a common spec would be enormous. Without common-format metadata, it will also be hard for any visualisation to transfer across aggregates, much less datasets. At the moment, the only visible effect of the lack of metadata is that the treemap shows codes instead of labels, but this would get worse much more quickly if you wanted to add multiple levels, customisable filters etc.

As for the computing options, I fail to see much value in solving (1) independently of (2): it can easily be expressed as an in-memory SQLite query, which saves you the hassle of having to implement bad, ad-hoc aggregation code. Even (3) can be done as (2) quite nicely.

So what I'm advocating is using cubepress basically as a query generator for babbage or cubes, which freezes the results and generates extra result metadata. The advantage is that you implement the core aggregation logic exactly once, and your models will apply perfectly for both scenarios at the same time.

rufuspollock commented 8 years ago

Agreed re just doing sql on this - it makes a lot of sense.

What is the current output format of cubes or babbage? That's important because that is what visualization would run off and so if we have that specced we can develop viz work and aggregation in tandem with the two kept separate via the API defined in the aggregation output.

pudo commented 8 years ago

babbage output is very much oriented after cubes, calling the Cube.aggregate library function will return something that is equivalent to a cubes response and vaguely resembles an OpenSpending APIv2 aggregate. It's also very easy to convert the included cells to CSV.

rufuspollock commented 8 years ago

@pudo can you link the cubes output documentation.

I'm also inlining that output for ref:

{
aggregates: [
  "_count",
  "suma.sum"
],
attributes: [
  "tipul.label"
],
page_size: 10000,
summary: {
  _count: 138477,
  suma.sum: 39809696414.5195
},
order: [ ],
cells: [
  {
    _count: 45,
   suma.sum: null,
   tipul.label: null
  },
  {
    _count: 108,
    suma.sum: 0,
    tipul.label: "Anulat din lipsa finanţării"
  },
],
  cell: [ ],
  total_cell_count: 14,
  status: "ok",
  page: 1
}

pudo commented 8 years ago

There you go: https://cubes.readthedocs.org/en/latest/server.html#aggregation-and-browsing

rufuspollock commented 8 years ago

@pudo thank-you (i'd poked around but wanted to be sure). Does babbage and cubepress follow this exactly - they look to (which is great).

openspending / cubepress

Documentation of output format and API #3