Open jqnatividad opened 4 months ago
Also consider https://digital-preservation.github.io/csv-schema/
Experimenting with this:
Sample .qsv
file in this ZIP: fruits.qsv.zip (can't share .qsv
on GitHub).
For comparison, note that several popular file formats are actually compressed "packages":
May be nice if the .qsv file is verified to be validated or there's a flag that can be quickly checked to see if it is or not along with whether an index is available.
Right @rzmk ! The .qsv
file, once implemented, is guaranteed to be ALWAYS valid, as the associated metadata/cache files will always be consistent with the core DATA stored in the archive. We can further ensure security by zipsign
ing the file so it cannot be tampered.
Further, we can assign a Digital Object Identifier (DOI) to each qsv file so we can track/trace its provenance, and possibly, downstream use.
If done properly, even with all the extra metadata in the .qsv
package, a .qsv
file will be even smaller than the raw version of the CSV!
This is because CSV files tend to have very high compression ratios - typically 80-90%, and all that extra metadata (stats, frequency tables, etc.) are tiny, just a few KBs, even for multi-gigabyte CSV files.
The qsv file will contain the cache file (#2097 ). It will also have all the metadata describing the dataset using the DCAT 3 (particularly, the DCAT-US v3 spec for the first implementation)
Related to #1705.
The profile
command will create the .qsv
file.
Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.
Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.
Thanks @Orcomp , do you have any benchmarks/metrics you can share? For Zstandard and other compression algorithms you considered?
You can check out https://morotti.github.io/lzbench-web
(From my personal experience, zstd has a good balance between compression ratio and compress/decompress speeds. I looked into this 2-3 years ago, so things might have changed a bit since.)
Instead of just signing the qsv using conventional techniques, "explore using two emerging standards: the W3C Verifiable Credentials Data Model 2.0 and Decentralized Identifiers (DIDs) v1.0 that leverage NIST's FIPS 186-5 but also align well with DCAT RDF model, making both human and machine readable."
Currently, qsv creates, consumes and validates CSV files hewing closely to the RFC4180 specification as interpreted by the csv crate.
However, it doesn't allow us to save additional metadata - about the CSV file (dialect, delimiter used, comments, DOI, url, etc.) nor the data the file contains (summary statistics, data dictionary, creator, last updated, hash of the data, etc.)
The request is to create a
.qsv
file format that is an implementation of W3C's CSV on the Web specification using guidance on https://csvw.org and store schemata/metadata/data in the qsv file that includes not just the schema info, but summary and frequency statistics as well; container for DCAT 3/CKAN package/resource metadata; etc.Doing so will unlock additional capabilities in qsv, qsv pro, Datapusher+ and CKAN.
It will also allow us to "clean-up" and consolidate the "metadata" files that qsv creates - the
stats
cache files, the index file, etc. and package up the CSV and its associated metadata in one container as a signed zip file.It will also make "harvesting" and federation with CKAN easier and more robust as all the needed data/metadata is in one container.