secretGeek / csvz

The hot new standard in open databases
Creative Commons Zero v1.0 Universal
30 stars 2 forks source link
csv data-science no-sql rdbms sql zip

csvz

csvz is the hot new open database standard that is taking the entire technological world by storm.

A csvz file is literally just a bunch of csv files, in a zip file, that has been renamed to have a ".csvz" file extension.


Are you using csvz ? Why not? csvz is the brave technology that unites the worlds of data science, sql and no-sql. Is it no-sql's answer to the rdbms? Or is it the rdbms answer to no-sql? You decide.


Contents


The csvz specification

The csvz specification is broken into meaningful fragments.

Files can call themselves csvz-compliant if they only comply with the first fragment of the specification, csvz-0.

They can also indicate other fragments of the specification that they have implemented, such as csvz-meta-tables, csv-meta-relations etc.


csvz-0 A csvz file is literally just a bunch of csv files, in a zip file with a file name that ends with ".csvz"

A csvz file is compliant with csvz-0 if it is literally just a bunch of csv files, in a zip file, that has been renamed to have a ".csvz" file extension.

(Note that each fragment has a fragment identifier written at the beginning of the fragment. For example this is csvz-0 and the next fragment is csvz-meta-tables. Fragments are optional, but it is good to know which fragments you do or do not comply with.)

The csv files themselves should be parseable with most csv reading software.

(Anywhere that this spec refers to "a csv file" it means a file that complies with RFC 4180 or a compatible dialect as described by the CSV on the Web Working Group, unless a stricter definition is explicitly given.)

(Anywhere that the csvz specification refers to "this spec" it means the csvz specification.)


csvz-meta-tables A csvz file can contain a file called tables.csv describing the contents of the file

Metadata about the contents of the csvz file is contained in a directory called "_meta". The file tables.csv, if present, is inside this directory.

(Assume that the csvz reserves the right to create other .csv files under the _meta folder, and to create more folders under it. Details appear in subsequent spec fragments.)

The file tables.csv contains metadata about all of the csv files included in the csvz file.

(The file tables.csv is a csv file.)

(Anywhere that this spec refers to a file with a name that ends with ".csv" it means the file is a "csv file", as described in csvz-0.)

The file tables.csv meets the following description:

(The word "must" is used for parts of the specification that are required for a file or tool to claim compliance with the standards described in this spec. The word "may" is used for parts which are not required; Optional sections may be covered in more detail, as required elements in a subsequent fragment of this spec.)

(Whenever suggestions are provided, they are not required for conformance with the current spec fragment. These suggestion may be described more fully in later spec fragments, in which they may be required.)

(Expectations around the encoding of true/false values, and other fundamental data-types, are not currently defined.)


csvz-meta-columns A csvz file can contain a file called columns.csv

Metadata about the contents of the csvz file is contained in a directory called "_meta". The file columns.csv, if present, is inside this directory.

The file columns.csv contains metadata about all of the columns in all of the csv files included in the csvz file.

The file columns.csv meets the following description:

(The word "should" is used for parts of the specification that are not required, but which will lead to difficulty for users of the data or the tools if they are not complied with.)


csvz-meta-relations A csvz file can contain a file called relations.csv

Metadata about the contents of the csvz file is contained in a directory called "_meta". The file relations.csv, if present, is inside this directory.

The file relations.csv contains metadata about all of the relationships between any of the columns in any of the files in the csvz file.

The file relations.csv meets the following description:


csvz-meta-csv A csvz file can contain a file called csv.csv

(todo: this section is still very much a draft)

Metadata about the rules of the csvz file are contained in a directory called "_meta". The file csv.csv, if present, is inside this directory.

The file csv.csv contains metadata about how the csv files in this csvz file are formatted, from a general csv standards point of view.

(Later spec fragments will give exact definitions for the expected columns and supported columns, their possible values and the meanings of those values.)

But to comply with csvz-meta-csv the file csv.csv must:

(todo: See also csvw dialect descriptions)


csvz-meta-per-file The ability to include individual meta-files per csv file

This fragment extends all other csvz-meta-* fragments.

Consider an example where a single csv file, people.csv inside the csvz follows different standards to the other files.

It's csv conventions could be described in a file: _meta/csv/people.csv and those would be taken to override the conventions in _meta/csv.csv

Similarly, a file can have its own _meta/tables/{filename}.csv file, _meta/columns/{filename}.csv and _meta/relations/{filename}.csv.

This methods can be assumed to extend for all other _meta/*.csv files.

A per-file meta file is assumed to have higher precedence than the files directly contained in _meta/*.csv.

For example: if _meta/columns.csv decribed the columns of states.csv in one way, but _meta/columns/states.csv described those columns in another way, all details for states.csv in _meta/columns.csv should be ignored, and those in _meta/columns/states.csv used instead. (i.e. they are not combined).

(Note - combining might be more interesting, useful. Would let you build up/inherit attributes. But would also need a way to "erase" a rule, and I can't think of a way to do that so let's stick with "no combining")

(Suggestion for authors of Tooling that reads these files: they may want to provide optional debug information that describes where meta data was sourced from, highlighting situations where precedence rules needed to be applied.)

You can also mix and match _meta/*.csv with per-file meta information, without loss of meaning.

For example the table states.csv may be described in _meta/tables.csv while it's columns may be described in _meta/columns/states.csv


Unwritten meta fragments

More meta-* spec fragments may be needed to describe other meta files.

For example:

A list of csvz-compliant Tools and Libraries

The following tools and libraries are able to read, write or process .csvz files.

Tool Actions Compliance Description
Sylvan.Data.CsvZip Create / Read csvz-0 csvz-meta-tables csvz-meta-columns Library for programatically creating and reading .csvz files
Sylvan.Tools.CsvZip Create csvz-0 csvz-meta-tables csvz-meta-columns .NET global tool for creating .csvz files from the commandline
Packs a set of csv files into a new csvz file, and generates a tables.csv and columns.csv
Converts a .csvz file into a .xlsx file, that can be opened by Excel.
Converts a .csvz file into a .xlsx file, that can be opened by Excel.
Converts a .xlsx file into a .csvz file (note that not all of Excel's features are respected.)
Exports a sqlite database into a new .csvz file
Creates a new sqlite database from a .csvz file
Exports a mysql database into a new .csvz file
Creates a new PostgreSQL database from a .csvz file
Exports a PostgreSQL database into a new .csvz file
Save a JSON file as a series of csv files and _meta files (ready for zipping)
Load some or all of an unzipped csvz as a single json object (limited filtering ability)
Validates which spec fragments a csvz file complies with
(More tools...)

If you know of a csvz compliant tool, or you have created one (hint hint), a pull request is welcome.

Suggestion: You can use existing csvz or csv libraries to build a new type of connection (e.g. A tool to create/read csvz files from an Oracle database, using existing libraries, would take some Oracle knowledge, and not much else.)


Contribute

To experience the fun of contributing, see Contributing

Contributors definitely includes people who raise issues. Raising issues is the quickest way to contribute. Also look for issues marked good first issue or help wanted

A community forum for discussion/ideas for implementors and tool builders is much needed, following issue #14 to find where the community will be built.

License

CC0

To the extent possible under law, Leon Bambrick has waived all copyright and related or neighboring rights to this work.


Some ideas are too smart to live; other ideas are too dumb to die.