Closed wardi closed 16 hours ago
I would consider this a feature :-)-O and on Linux and the Mac you can easily filter for that and then put it into a pipe in front of qsv
.
The binary format supported is snappy
and while I am unimpressed by it, myself, it is very fast, so on very large data sets I would first (g)unzip
and then snap
for repeated use.
While I like gzip
I am not sure feature bloat is helpful when it can be easily done by pipe or Shell script.
Crashing with a thread 'main' panicked
is a feature? :thinking:
Hi @wardi ,
As stats
is a central qsv command and the main engine behind DataPusher+, I've tweaked it over time to squeeze as much performance as possible from it to enable the "Automagical Metadata" qualities we're both working on in CKAN.
As such, its top goal is performance.
That's why I chose to support Snappy, instead of more popular compression formats like gz and zip..
Another goal of qsv is composability, so as you and @ondohotola pointed out, qsv can be easily used with a other purpose-built command-line tools.
But you're right, qsv should at least check for supported formats and fail gracefully rather than panic.
Currently, it already has logic to detect CSV, TSV/TAB and SSV formats and their Snappy compressed variants (csv.sz, tsv.sz, tab.sz and ssv.sz) and set the default delimiter accordingly and compress/decompress automatically and it could be easily extended.
In the meantime, you may want to use validate
upstream of stats
in your pipeline. That's what DP+ and qsv pro does, as the first thing it does when ingesting a dataset. If not provided a JSON Schema, it goes to RFC-4180 validation mode and also checks if the file is UTF-8 encoded.
Thanks @jqnatividad, maybe when I finally get into rust I could send a PR with some more automatic stream compression/decompression formats
Hi @wardi ,
Went the extra mile and added mime-type inferencing using the file-format
crate, which is already being used by the sniff
command (which, may be of interest to you too as sniff
was created to support next-gen CKAN harvesting - being able to harvest remote CSVs metadata by just sampling them)
Also added a more human-friendly panic handler with the human-panic
crate.
Describe the bug
qsv crashes if given binary data
To Reproduce
Expected behavior Report an error with the expected format, or for bonus points handle .gz, .bz2, .xz etc automatically
Screenshots/Backtrace/Sample Data
Desktop (please complete the following information):
Additional context
Happily qsv does work fine in a pipeline like
zcat mybigdata.csv.gz | qsv stats