rufuspollock / ideas

Ideas for (tech) stuff to research, build or work on.
https://rufuspollock.com/
50 stars 4 forks source link

One big table for all datasets #93

Open ppKrauss opened 6 years ago

ppKrauss commented 6 years ago

Today PostgreSQL 9.5+ offers JSON datatype and a complete "tool kit" for JSON manipulation... So we can get all datasets of an "ecosystem of datasets" like Data Packaged Core Datasets, and put all together.

Expressing with SQL:

CREATE SCHEMA dataset;

CREATE TABLE dataset.meta (
    id serial PRIMARY KEY,
    locname text, -- local-name of the dataset 
    kx_fields text[],  -- if need, some cache of fieldnames
    kx_types text[],  -- ... and datatypes
    info JSONb  -- all pieces of datapackage.json here!
);

CREATE TABLE dataset.big (
  id bigserial not null primary key,
  source int NOT NULL REFERENCES dataset.meta(id) ON DELETE CASCADE, -- Ref. to metadata.
  key text,  -- Dataset primary key (casted to text) is optional (can be also a hash).
  c JSONb CHECK(jsonb_array_length(c)>0), -- dataset CSV line cloned as array
  UNIQUE(source,key)
);

The dataset.meta.info is the FrictionLessData standard for tabular data, and each CSV line is into dataset.big.c, a line of Row Arrays format.
PS: all CSV lines are there as fast and compact JSON representation, with no datatype translation (JSON strings are CSV strings, JSON numbers are CSV numbers, etc.).

rufuspollock commented 6 years ago

@ppKrauss good suggestion. Can you detail a bit more what this would allow someone to do - what the user story would be as it were.

ppKrauss commented 6 years ago

Hi @rufuspollock, thanks. I am trying at https://github.com/datasets-br/sql-unifier

PS: the PostgreSQL 9.3+ FOREGIN TABLE have good performance with "big data", so will be fast and reliable with any large CSV.

rufuspollock commented 6 years ago

@ppKrauss got it - can you provide a user story like As a X I want to Y so that Z?

ppKrauss commented 6 years ago

Hi @Ruffus, thank you for the invite... I have some difficulty with English. Check if it is what you want:

Title: SQL dataset unifier.

Abstract: as the Datasets are scattered on isolated repositories,
I want to put them all together into a big SQL-table, modeling all CSVs as JSON-arrays and also offering them as usual SQL-VIEWs,
so I can preserve datasets (intacts) on PostgreSQL and do JOINs, filters and all SQL usual operations.

 

PS: another way to summarize motivations is citating csvkit sec.3,

Sometimes (almost always), the command-line isn’t enough. It would be crazy to try to do all your analysis using command-line tools. Often times, the correct tool for data analysis is SQL.


Perhaps a WHAT/WHY list will offer better clues about the ideia and its context... But a wish list or a list from the implemented proof of concept? Here some more real-life examples.