cubepress and OpenSpending Aggregations

rufuspollock commented 8 years ago

Proposal is to make this the primary aggregator library for OpenSpending "next". Basic structure:

[ ] OpenSpending / Tabular Data Package + CSV => SQL
[ ] Spec for defining queries (SQL) + how those are "frozen" to disk (data freeze style)
[ ] Running of those queries
More Detail
1. CSV -> model -> OpenSpending (Tabular) Data Package
  - For aggregator work we can take this as a given
  - We will practice off BOOST data
2. OS DP + CSV -> SQL db (e.g. sqlite or postgres)
  - JTS-SQL comes in
  - What’s missing in TDP support + e.g. foreign keys
  - tabular data stream - csv to stream with casting
  - load that data
3. Define aggregates, compute, and “freeze” to disk
  - compute aggregates - run SQL
  - freeze -> json or sql
  - can use datafreeze for this all ...
    Example code

compute_aggregates(datapackage.json, aggregates.json)
    engine = sqlalchemy.engine(...)
    make_sql(engine, datapackage.json)
    load_data(engine, datapackage.json)
    for aggregates in aggregates.json:
        run_aggregate(...)
        save_results(...)  # output is aggregates/by-department.json or .csv

example data input layout on disk:

boost-armenia
    /aggregates/
    /aggregates.json -> specifies the aggregates to compute
    /viz/
        /treemap/index.html

Code snippets

# Loading a Data Package and getting a stream
package = TabularDataPackage(‘datapackage.json”)
for row in package.resources[0].stream:
   # rows are pre-cast to their correct types etc

make_sql:
    jts-kit.make_database

load_data():
    get_stream(datapackage.resources[0])
    for row in stream:
        engine.insert_row(...)

Example command line:

cubepress datapackage.json aggregate.json dbname # default to sqlite as database type

=> 
csv file:
amount, year, department

datapackage.json
“name”: “mydata”
    [
        {
                    “name”: “amount”
                    “type”: “number”
                }
]   

aggregates.json:
sql: SELECT SUM(amount), department FROM “mydata” WHERE year = 2013
file: by-department
format: csv

=>

/aggregates/by-department.csv

Initial steps

Write tests for compute_aggregates
Copy across BOOST data
Some links

https://github.com/okfn/dptools/blob/master/bin/load-postgresql.py

pudo commented 8 years ago

For reference (following discussion on IRC):

The input format for logical models in babbage is this, an example is here.
To generate queries, this is augmented using a spec like this. The filters section in this basicaly gives run-time variables which can be toggled by the user (see here) - but they could also be applied statically. The query spec isn't read by babbage, but by a client app (i.e. offenerhaushalt or cubepress).

rufuspollock commented 8 years ago

One question: what is the benefit of having JSON structured queries as opposed to raw SQL. I guess my assumption is one could have both and that JSON structured queries become SQL and are useful because you can create them from a visualization/aggregation editor.

pudo commented 8 years ago

Generating a query is one (important) aspect of that JSON, but not the only one. You also need metadata for the visualisation to know which data to pick, and which fields to apply to which visual aspect in the visualisation.

Beyond that, the filter are also used to generate distinct value sets so that the resulting web site can have drop-downs.

You can hard-code all of that, but not having a clean model just ends up causing you more work in each specific instance.

Just for fun, check out the site-config generated from the view spec in OffenerHaushalt. It's a JSON blob in the frozen HTML.

rufuspollock commented 8 years ago

DataFreeze docs: https://dataset.readthedocs.org/en/latest/freezefile.html

openspending / cubepress

cubepress and OpenSpending Aggregations #4

More Detail

Example code

Code snippets

Initial steps

Some links