jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit
https://qsv.dathere.com
The Unlicense
2.52k stars 71 forks source link

crash on binary data, native support for compressed csv? #2301

Closed wardi closed 16 hours ago

wardi commented 1 day ago

Describe the bug

qsv crashes if given binary data

To Reproduce

$ qsv stats mybigdata.csv.gz 
thread 'main' panicked at /home/runner/.cargo/git/checkouts/rust-csv-4524c5d96b17e863/7dc2760/src/byte_record.rs:277:56:
range end index 3569017630560 out of range for slice of length 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior Report an error with the expected format, or for bonus points handle .gz, .bz2, .xz etc automatically

Screenshots/Backtrace/Sample Data

Desktop (please complete the following information):

Additional context

Happily qsv does work fine in a pipeline like zcat mybigdata.csv.gz | qsv stats

ondohotola commented 1 day ago

I would consider this a feature :-)-O and on Linux and the Mac you can easily filter for that and then put it into a pipe in front of qsv.

The binary format supported is snappy and while I am unimpressed by it, myself, it is very fast, so on very large data sets I would first (g)unzip and then snap for repeated use.

While I like gzip I am not sure feature bloat is helpful when it can be easily done by pipe or Shell script.

wardi commented 1 day ago

Crashing with a thread 'main' panicked is a feature? :thinking:

jqnatividad commented 1 day ago

Hi @wardi , As stats is a central qsv command and the main engine behind DataPusher+, I've tweaked it over time to squeeze as much performance as possible from it to enable the "Automagical Metadata" qualities we're both working on in CKAN.

As such, its top goal is performance.

That's why I chose to support Snappy, instead of more popular compression formats like gz and zip..

Another goal of qsv is composability, so as you and @ondohotola pointed out, qsv can be easily used with a other purpose-built command-line tools.

But you're right, qsv should at least check for supported formats and fail gracefully rather than panic.

Currently, it already has logic to detect CSV, TSV/TAB and SSV formats and their Snappy compressed variants (csv.sz, tsv.sz, tab.sz and ssv.sz) and set the default delimiter accordingly and compress/decompress automatically and it could be easily extended.

In the meantime, you may want to use validate upstream of stats in your pipeline. That's what DP+ and qsv pro does, as the first thing it does when ingesting a dataset. If not provided a JSON Schema, it goes to RFC-4180 validation mode and also checks if the file is UTF-8 encoded.

wardi commented 1 day ago

Thanks @jqnatividad, maybe when I finally get into rust I could send a PR with some more automatic stream compression/decompression formats

jqnatividad commented 15 hours ago

Hi @wardi , Went the extra mile and added mime-type inferencing using the file-format crate, which is already being used by the sniff command (which, may be of interest to you too as sniff was created to support next-gen CKAN harvesting - being able to harvest remote CSVs metadata by just sampling them)

Also added a more human-friendly panic handler with the human-panic crate.