Open ppKrauss opened 7 years ago
@ppKrauss good suggestion. Can you detail a bit more what this would allow someone to do - what the user story would be as it were.
Hi @rufuspollock, thanks. I am trying at https://github.com/datasets-br/sql-unifier
PS: the PostgreSQL 9.3+ FOREGIN TABLE
have good performance with "big data", so will be fast and reliable with any large CSV.
@ppKrauss got it - can you provide a user story like As a X I want to Y so that Z?
Hi @Ruffus, thank you for the invite... I have some difficulty with English. Check if it is what you want:
Title: SQL dataset unifier.
Abstract: as the Datasets are scattered on isolated repositories,
I want to put them all together into a big SQL-table, modeling all CSVs as JSON-arrays and also offering them as usual SQL-VIEWs,
so I can preserve datasets (intacts) on PostgreSQL and do JOINs, filters and all SQL usual operations.
PS: another way to summarize motivations is citating csvkit sec.3,
Sometimes (almost always), the command-line isn’t enough. It would be crazy to try to do all your analysis using command-line tools. Often times, the correct tool for data analysis is SQL.
Perhaps a WHAT/WHY list will offer better clues about the ideia and its context... But a wish list or a list from the implemented proof of concept? Here some more real-life examples.
Today PostgreSQL 9.5+ offers JSON datatype and a complete "tool kit" for JSON manipulation... So we can get all datasets of an "ecosystem of datasets" like Data Packaged Core Datasets, and put all together.
Expressing with SQL:
The
dataset.meta.info
is the FrictionLessData standard for tabular data, and each CSV line is intodataset.big.c
, a line of Row Arrays format.PS: all CSV lines are there as fast and compact JSON representation, with no datatype translation (JSON strings are CSV strings, JSON numbers are CSV numbers, etc.).