jqnatividad / qsv

CSVs sliced, diced & analyzed.
The Unlicense
2.41k stars 67 forks source link

BUG Incorrect delimiter in qsv sniff #1719

Open EricSoroos opened 4 months ago

EricSoroos commented 4 months ago

Describe the bug QSV sniff from 0.102, 0.112, and 0.125 is incorrectly determining the delimiter for a specific file.

To Reproduce incorrect-delimiter.csv

$ qsv sniff file.csv
Path: ...file.csv
Sniff Timestamp: 2024-04-03T15:41:05.706837+00:00
Last Modified: 2024-04-03T15:18:29+00:00
Delimiter: E
Header Row: true
Preamble Rows: 0
Quote Char: none
Flexible: false
Is UTF8: true
Detected Mime Type: text/plain
Detected Kind: Other
Retrieved Size (bytes): 4,375
File Size (bytes): 4,375
Sampled Records: 52
Estimated: false
Num Records: 52
Avg Record Len (bytes): 76
Num Fields: 2
Stats Types: false
Fields:
    0:  Text  Quarter,Completions N
    1:  Text  W,Commencements,Completions > Commencements ,%YoY Completions ,%YoY Commencements ,YoY Completions ,YoY Commencements ,%QoQ Completions  ,%QoQ Commencements ,QoQ Completions ,QoQ Commencements 

Expected behavior Comma as the delimiter. Columns the same as qsv stats or qsv headers.

$ qsv headers incorrect-delimiter.csv 
1   Quarter
2   Completions NEW
3   Commencements
4   Completions > Commencements 
5   %YoY Completions 
6   %YoY Commencements 
7   YoY Completions 
8   YoY Commencements 
9   %QoQ Completions  
10  %QoQ Commencements 
11  QoQ Completions 
12  QoQ Commencements 
$ qsv stats incorrect-delimiter.csv 
field,type,sum,min,max,range,min_length,max_length,mean,stddev,variance,nullcount,sparsity
Quarter,String,,Q1 12,Q4 23,,5,7,,,,0,0
Completions NEW,Integer,64693,106,3799,3693,3,4,1244.0962,956.6243,915130.0484,0,0
Commencements,String,,"1,855",898,,2,5,,,,0,0
Completions > Commencements ,String,,FALSE,TRUE,,4,5,,,,0,0
%YoY Completions ,String,,-0.60%,92.80%,,0,7,,,,4,0.0769
%YoY Commencements ,String,,-11.20%,88.00%,,0,7,,,,4,0.0769
YoY Completions ,Float,12016.1609,-885.9696684,1487.868126,2373.8378,0,12,250.3367,462.0062,213449.7579,4,0.0769
YoY Commencements ,Integer,13255,-2823,3859,6682,0,5,276.1458,999.0011,998003.2912,4,0.0769
%QoQ Completions  ,String,,-0.50%,9.26%,,0,7,,,,1,0.0192
%QoQ Commencements ,String,,-0.10%,91.50%,,0,8,,,,1,0.0192
QoQ Completions ,Float,3661.8907,-982.0637068,819.2936393,1801.3573,0,12,71.8018,334.6705,112004.3264,1,0.0192
QoQ Commencements ,Integer,3936,-2064,4618,6682,0,5,77.1765,923.7335,853283.5963,1,0.0192

Desktop (please complete the following information):

jqnatividad commented 4 months ago

sniff uses the qsv-sniffer crate, which in turn, is a fork of the csv-sniffer crate.

I ended up creating qsv-sniffer since it seems csv-sniffer was no longer being actively maintained as shown by the numerous unanswered issues.

But TBH, the Viterbi algorithm it uses to sniff and infer CSV schemas is still something I don't fully grok, so it trips up on certain CSVs.

It works well enough for "typical" CSVs and I've tweaked it enough to remove the panics, but the whole Viterbi inferencing part needs to be refactored.

FYI, my intent with qsv sniff is to help power a next-gen harvester, that's why I've added all the other extra stuff (mime-type sniffing, range requests sniffing, etc.).

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

EricSoroos commented 4 months ago

I'll take a look on it -- We've apparently hit this before (according to a co-worker) and we've got a build of dp+ that basically ignores non-standard delimiters. So immediate issue is patched around.

So, I'm not good with rust (you've seen basically all of what I've ever done), but this confuses me: https://github.com/jqnatividad/qsv-sniffer/blob/master/src/sniffer.rs#L506 . We're calling this with all of the possible quote characters in character, delim is None. This for the "most common" case (csv with "), this regex (somewhat simplified) is hopefully looking for "[ ]*?,[ ]*". If there are no quotes in the csv, then I'm unclear what it's ever going to match, but apparently it was coming up with E?

The delimiter count here is only going to ever pick up a valid delimiter if there's a quote delim quote pattern, which doesn't look like it's going to happen with this test file, and isn't necessarily going to be consistent unless it's a 'quote all' type of file. I suspect if the initial guess of the delimiter was good, then we'd probably get the Viterbi confirming the choice.

What if we look at a sample of likely delimiters. ,\t;|:, and ran counts per line. even in the ignorance of quotes, there should be one that has a roughly consistent number of occurrences per line, with the min value likely to be approximately the number of columns. It would at least rule out ones that don't appear, or have large variations in the number of occurrences per line. That would probably work better for then figuring out a quote, because quotes should be either [quote]{2}, or [quote][delim] or [delim][quote], or it's not a valid quoting for the file.

EricSoroos commented 4 months ago

So having a look at what python's csv.sniff is doing, the _guess_quote_and_delimiter https://github.com/python/cpython/blob/main/Lib/csv.py#L273 is very similar, but covers all 4 possible quote patterns for a quoted field, not trying to find the possible cases for a delimiter between quotes.

It also includes a plain _guess_delimiter https://github.com/python/cpython/blob/main/Lib/csv.py#L349 that's essentially what I was suggesting in the last para as a fallback to when guessing the quote and delimiter at the same time doesn't work.

jqnatividad commented 4 months ago

Thanks for digging into this @EricSoroos !

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

Do you know if it uses the same Viterbi algorithm? If not, I'm inclined to just rewrite qsv-sniffer to just do a straight port of python's csv sniffer...

EricSoroos commented 4 months ago

It doesn't look like it -- it just looks like a frequency based check. Viterbi looks like a general constrain satisfaction algorithm, so it's just one way to determine if a set of parameters is the most likely description of the data.

jqnatividad commented 4 months ago

Awesome! I'll create a new branch on the qsv-sniffer crate then and start porting over the code.

Would be nice if we can co-maintain it as there's really nothing like python's csv sniffing in the Rust ecosystem beyond csv-sniffer and qsv-sniffer. Polars has a very fast multi-threaded, mem-mapped CSV reader, and it has some CSV schema sniffing smarts, but its not a general library that can be used separately without bringing in a lot of polars crates.

jqnatividad commented 4 months ago

Here's the tracking issue for the EPIC (in more ways than one 😉 ) port of python's csv-sniff

https://github.com/jqnatividad/qsv-sniffer/issues/14

jqnatividad commented 1 month ago

Perhaps, the best way to "fix" this is to NOT use the qsv-sniffer crate and just use the workhorse csv crate instead.

The new approach will just attempt to open the candidate file as a CSV file, and get the metadata that way.