jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit

https://qsv.dathere.com

The Unlicense

2.52k stars 71 forks source link

`stats`: add dataset level stats #2288

Open jqnatividad opened 1 week ago

jqnatividad commented 1 week ago

Currently, stats only computes column-level stats.

Also add dataset-level stats, with the "_qsv_" prefix, like:

rowcount (_qsv_rowcount)
column count (_qsv_columncount)
filesize (_qsv_filesize)
file hash (_qsv_hash) using xxHash algorithm (use twox-hash crate)

The value for each dataset stat will be stored in a column named _qsv_value

jqnatividad commented 4 days ago

instead of computing the file hash which may take a long time for large files, just compute the hash of all the stats, including the rowcount, column count and the filesize.

This pretty much guarantees the hash will be unique for the file in its current state, without having to scan the entire file, serving as a "fingerprint hash."

jqnatividad commented 2 days ago

2297 is still WIP, but changed the dataset-level stats to:

qsv__rowcount
qsv__columncount
qsv__filesize_bytes
qsv__fingerprint_hash with their values stored in the last column of stats as qsv__value.

Removed the leading underscore because it was tripping up select in CI as underscore is a select sentinel value for last column. Made the prefix qsv__ with two trailing underscores.

jqnatividad commented 1 day ago

This is not truly done until corresponding CI tests succeed. stats has hundreds of tests, so this will take a bit of effort.