Open jqnatividad opened 1 week ago
instead of computing the file hash which may take a long time for large files, just compute the hash of all the stats, including the rowcount, column count and the filesize.
This pretty much guarantees the hash will be unique for the file in its current state, without having to scan the entire file, serving as a "fingerprint hash."
qsv__rowcount
qsv__columncount
qsv__filesize_bytes
qsv__fingerprint_hash
with their values stored in the last column of stats as qsv__value
.Removed the leading underscore because it was tripping up select
in CI as underscore is a select sentinel value for last column. Made the prefix qsv__
with two trailing underscores.
This is not truly done until corresponding CI tests succeed.
stats
has hundreds of tests, so this will take a bit of effort.
Currently,
stats
only computes column-level stats.Also add dataset-level stats, with the "_qsv_" prefix, like:
file hash (_qsv_hash) using xxHash algorithm (use twox-hash crate)
The value for each dataset stat will be stored in a column named
_qsv_value