jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit
https://qsv.dathere.com
The Unlicense
2.52k stars 71 forks source link

Create a `.qsv` file format that is an implementation of W3C's CSV on the Web #1982

Open jqnatividad opened 4 months ago

jqnatividad commented 4 months ago

Currently, qsv creates, consumes and validates CSV files hewing closely to the RFC4180 specification as interpreted by the csv crate.

However, it doesn't allow us to save additional metadata - about the CSV file (dialect, delimiter used, comments, DOI, url, etc.) nor the data the file contains (summary statistics, data dictionary, creator, last updated, hash of the data, etc.)

The request is to create a .qsv file format that is an implementation of W3C's CSV on the Web specification using guidance on https://csvw.org and store schemata/metadata/data in the qsv file that includes not just the schema info, but summary and frequency statistics as well; container for DCAT 3/CKAN package/resource metadata; etc.

Doing so will unlock additional capabilities in qsv, qsv pro, Datapusher+ and CKAN.

It will also allow us to "clean-up" and consolidate the "metadata" files that qsv creates - the stats cache files, the index file, etc. and package up the CSV and its associated metadata in one container as a signed zip file.

It will also make "harvesting" and federation with CKAN easier and more robust as all the needed data/metadata is in one container.

jqnatividad commented 4 months ago

Also consider https://digital-preservation.github.io/csv-schema/

rzmk commented 3 months ago

Experimenting with this:

image

Sample .qsv file in this ZIP: fruits.qsv.zip (can't share .qsv on GitHub).

jqnatividad commented 3 months ago

For comparison, note that several popular file formats are actually compressed "packages":

rzmk commented 3 months ago

May be nice if the .qsv file is verified to be validated or there's a flag that can be quickly checked to see if it is or not along with whether an index is available.

jqnatividad commented 3 months ago

Right @rzmk ! The .qsv file, once implemented, is guaranteed to be ALWAYS valid, as the associated metadata/cache files will always be consistent with the core DATA stored in the archive. We can further ensure security by zipsigning the file so it cannot be tampered.

Further, we can assign a Digital Object Identifier (DOI) to each qsv file so we can track/trace its provenance, and possibly, downstream use.

jqnatividad commented 3 months ago

If done properly, even with all the extra metadata in the .qsv package, a .qsv file will be even smaller than the raw version of the CSV! This is because CSV files tend to have very high compression ratios - typically 80-90%, and all that extra metadata (stats, frequency tables, etc.) are tiny, just a few KBs, even for multi-gigabyte CSV files.

jqnatividad commented 2 months ago

The qsv file will contain the cache file (#2097 ). It will also have all the metadata describing the dataset using the DCAT 3 (particularly, the DCAT-US v3 spec for the first implementation)

jqnatividad commented 2 months ago

Related to #1705. The profile command will create the .qsv file.

Orcomp commented 1 month ago

Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.

jqnatividad commented 1 month ago

Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.

Thanks @Orcomp , do you have any benchmarks/metrics you can share? For Zstandard and other compression algorithms you considered?

Orcomp commented 1 month ago

You can check out https://morotti.github.io/lzbench-web

(From my personal experience, zstd has a good balance between compression ratio and compress/decompress speeds. I looked into this 2-3 years ago, so things might have changed a bit since.)

jqnatividad commented 1 month ago

Instead of just signing the qsv using conventional techniques, "explore using two emerging standards: the W3C Verifiable Credentials Data Model 2.0 and Decentralized Identifiers (DIDs) v1.0 that leverage NIST's FIPS 186-5 but also align well with DCAT RDF model, making both human and machine readable."

See https://github.com/DOI-DO/dcat-us/issues/132