openspending / babbage

A light-weight, OLAP-style analytical engine for Postgres (focused on OpenSpending)
MIT License
37 stars 16 forks source link

Support multiple tables / star schema #3

Open pudo opened 8 years ago

pudo commented 8 years ago

Right now, only a single fact table is supported, and aggregations of dimensions will assume that all columns for a given dimension are perfectly denormalised. The application should support JOINs for a star schema instead.

pwalsh commented 8 years ago

@akariv first we need to understand if we need this. Assigning to you to investigate as part of general deep dive on babbage.

pudo commented 8 years ago

Might be useful to separate between being able to load from multiple resources and keeping multiple tables for dimensions internally. The two are almost certainly a "false friend" equality (i.e. looks like the same thing but in practice turns out not to be one).

pwalsh commented 8 years ago

@pudo totally agree. seems to me that single resources in FDP speak could (should?) load into multiple tables (star schema).

I definitely do not want babbage to change to reflect the various modelling strategies in an FDP mapping: rather, that the loader can work with the mapping to write to a consistent (normalised?) babbage model.

I guess you are currently in the best position to advise on a way forward around that.

pudo commented 8 years ago

There was an interesting thought from @stiivi (also relevant to #4): he proposed loading the source data facts into a JSON column on the fact table (i.e. as-is) and then generating the actual columns and star schema as the model is defined through PostgreSQL commands (i.e. inside the DB).

Otherwise you end up in that spot where each time someone changes the label for an attribute, you have to re-load 900 MB of source CSV.

pwalsh commented 8 years ago

I'm going to backlog this now. We don't need this now, and while there are great ideas in here, let's focus on our core tasks first.

larjohn commented 8 years ago

I am not sure if now is the time to pop this out of the backlog.

We are interested in using such a feature in openbudgets.eu. Other than the backend implementation, in which I can't offer any opinion yet, what is the suggested way to support this API-wise?

Would it be a call to the aggregates/facts etc endpoints without a cube path part?

akariv commented 8 years ago

I would say that from an API point of view, nothing needs to change. The API should reflect to the user a denormalized data table, while using normalized tables in the backend.

larjohn commented 8 years ago

Yes, but how do you imagine a call of this service? This call:

http://next.openspending.org/api/3/cubes/405f5e1681d4b6673fbfc0ceb196fe47:boost-armenia-2006-2014/aggregate?drilldown=activity.program&pagesize=30&cut=classification.econ_func%3A%22Economic%22&order=adjusted.sum%3Adesc

already contains the cube name in it. Will "cube-less" queries be supported somehow?

pudo commented 8 years ago

Just out of interest: what are the semantics of a cube-less call? I imagine it would be quite hard to properly aggregate budgetary data even with slightly different origins. If we're talking RDF here: Wouldn't it be less dangerous for a data integration tool to actually materialise merged cubes with well-defined semantics? (e.g. "here's the merged cube of all cohesion fund data sub-cubes")

larjohn commented 8 years ago

Well a cube-less call is a fictional, ideal call, intended to force us explore alternative solutions.

We are indeed working with RDF at openbudgets.eu, but we do not desire to suggest any direction for openspending based on our expactations from RDF.

This means we don't even care if openspending's loader materialises the merged cubes. In rudolf, we have to tackle the same issues: determine which dimensions are similiar, determine which values of these dimensions are similar (in order to support common cuts) and find a way to get normalized values (mostly currency-related normalization).

Regarding the dimensions: we can either cross our fingers and rely on dimensions inheriting from common parents (also having values linked with some kind of similarity predicate) or go with CONSTRUCTqueries which indeed materialize the datasets into a new cube. Both ways are implementable our side, I was just wondering if, API-wise, there would be some actual super-cube to query in the same way as with the ordinary cubes.