jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit
https://qsv.dathere.com
The Unlicense
2.52k stars 71 forks source link

`stats`: add dataset level stats #2288

Open jqnatividad opened 1 week ago

jqnatividad commented 1 week ago

Currently, stats only computes column-level stats.

Also add dataset-level stats, with the "_qsv_" prefix, like:

jqnatividad commented 4 days ago

instead of computing the file hash which may take a long time for large files, just compute the hash of all the stats, including the rowcount, column count and the filesize.

This pretty much guarantees the hash will be unique for the file in its current state, without having to scan the entire file, serving as a "fingerprint hash."

jqnatividad commented 2 days ago

2297 is still WIP, but changed the dataset-level stats to:

Removed the leading underscore because it was tripping up select in CI as underscore is a select sentinel value for last column. Made the prefix qsv__ with two trailing underscores.

jqnatividad commented 1 day ago

This is not truly done until corresponding CI tests succeed. stats has hundreds of tests, so this will take a bit of effort.