Sniffing fails on non-UTF-8 files

harrybiddle commented 2 years ago

Hey all,

I tried running the csv sniffer (as part of qsv, see https://github.com/jqnatividad/qsv/issues/199) on the following file, but it doesn't seem to work.

test.csv

Not sure if the problem is in qsv or here, but since qsv works on other files I though it's most likely an issue here...

jblondin commented 2 years ago

It looks like it might be an issue with qsv, since it works for me locally:

 $ ./target/release/sniff test.csv
Metadata
========
Dialect:
        Delimiter: ;
        Has header row?: true
        Number of preamble rows: 0
        Quote character: none
        Flexible: false

Number of fields: 11
Types:
        0: Text
        1: Unsigned
        2: Float
        3: Float
        4: Text
        5: Text
        6: Text
        7: Text
        8: Text
        9: Text
        10: Text

I'll try testing it out from qsv as well.

harrybiddle commented 2 years ago

Aha, I didn't realise that the build ships with a CLI. From the README I thought it was library-only. Perhaps it's worth including the CLI in the README?

jblondin commented 2 years ago

Definitely! I'll leave this open as a documentation issue then.

jqnatividad commented 2 years ago

Hhmmm... weird, it's not working for me using the csv-sniffer CLI:

./target/release/sniff ~/Downloads/test.csv 
ERROR: IO error: stream did not contain valid UTF-8

Happens on both Windows 11 and Ubuntu Linux LTS 20.04...

jblondin commented 2 years ago

Interesting! and of course I run it on macOS. I'll check it out in containers.

jqnatividad commented 2 years ago

Hmmm... it also happens on my MacBook Air 2018 running Monterey...

I wonder if its because of some locale settings @jblondin

jblondin commented 2 years ago

Interesting, I'm on an M1 running Monterey. I'll investigate.

jblondin commented 2 years ago

The encoding on this test file seems to be 'Western (Windows 1252)', while the library currently only supports UTF-8.

At some point when I originally was testing this I apparently re-saved it at UTF-8 prior to running my test, which is why it initially worked for me. I just re-downloaded and re-ran it and am encountering the same error.

jqnatividad commented 2 years ago

@jblondin I encountered the same issue with qsv. To minimize encoding errors, I used https://github.com/BurntSushi/encoding_rs_io to automatically transcode to UTF-8.

jblondin commented 2 years ago

I found the same 😄 I started looking into it yesterday, should have a fix soon

jblondin commented 2 years ago

So it doesn't look like it's as simple a fix as just using encoding_rs_io as a transcoder, since that only handles transcoding between UTF-16 and UTF-8.

qsv is mostly resilient to encoding woes due to its predominant usage of ByteRecords from the csv package, but does fail on some commands, e.g.

> qsv pseudo -d ';' COD.IMPDR.EXPDR tests/data/semicolon.nonUTF.csv
DIA.DESEMB,COD.SUBITEM.NCM,VMLE.DOLAR.BAL.EXP,PESO.LIQ.MERC.BAL.EXP,COD.IMPDR.EXPDR,NOME.IMPDR.EXPDR,PAIS.ORIGEM.DESTINO,UA.LOCAL.DESBQ.EMBQ,NOME.IMPORTADOR.ESTRANGEIRO,NUM.DDE,NUM.RE
CSV parse error: record 1 (line 1, field: 6, byte: 184): invalid utf-8: invalid UTF-8 in field 6 near byte index 4

I'll dig into this a little more to see if there's a solution.

jqnatividad commented 2 years ago

Hi @jblondin , thanks to your research, I ended up just saying that qsv requires UTF-8 and leave it at that.

With that assumption, I then fully embraced using from_utf8_unchecked for the extra performance. 😉

That said, it'd be awesome if csv-sniffer can also sniff a csv's encoding!

jblondin commented 2 years ago

For now, I'm just going to have csv-sniffer accept the encoding as a parameter.

Sniffing the encoding is a whole other (fascinating) subject, I may add it or spin off another crate with that functionality. My understanding is that it's very heuristic-based, and prone to error, but there are definitely tools that do it (Sublime Text for instance detects and handles different encodings just fine). Having an encoding sniffer + automatic transcoder might be really useful.

For qsv, an expectation of UTF-8 makes sense 😄 I did notice that you already use from_utf8_unchecked at times (in the stats module, for instance), which I believe is undefined behavior on non-UTF-8 files, but it seems to work fine on this file at least!

jqnatividad commented 2 years ago

Yes. I was really trying to squeeze as much performance as possible from qsv stats, as its central to the project I'm working on (scanning a CSV file for stats and data types, and then prepopulating metadata about it - data dictionary, frequency table, descriptive stats, jsonschema) in a CKAN data catalog while they're entering the metadata.

And now that I've decided to embrace the utf8 requirement, I'm doing more from_utf8_unchecked throughout qsv, especially, in the hot loops.

As to detecting encoding, I found https://github.com/thuleqaid/rust-chardet, which is inspired by https://pypi.org/project/chardet/.

It seems promising, even reqwest at one time was considering using it were it not for the incompatible license, though it looks unmaintained.

I also found chardetng, but its targeted for web use, as the character encoding detector of Firefox (I found this writeup fascinating!)

Perhaps you can leverage it for csv-sniffer?

jblondin / csv-sniffer

Sniffing fails on non-UTF-8 files #13