Open harrybiddle opened 2 years ago
It looks like it might be an issue with qsv, since it works for me locally:
$ ./target/release/sniff test.csv
Metadata
========
Dialect:
Delimiter: ;
Has header row?: true
Number of preamble rows: 0
Quote character: none
Flexible: false
Number of fields: 11
Types:
0: Text
1: Unsigned
2: Float
3: Float
4: Text
5: Text
6: Text
7: Text
8: Text
9: Text
10: Text
I'll try testing it out from qsv as well.
Aha, I didn't realise that the build ships with a CLI. From the README I thought it was library-only. Perhaps it's worth including the CLI in the README?
Definitely! I'll leave this open as a documentation issue then.
Hhmmm... weird, it's not working for me using the csv-sniffer CLI:
./target/release/sniff ~/Downloads/test.csv
ERROR: IO error: stream did not contain valid UTF-8
Happens on both Windows 11 and Ubuntu Linux LTS 20.04...
Interesting! and of course I run it on macOS. I'll check it out in containers.
Hmmm... it also happens on my MacBook Air 2018 running Monterey...
I wonder if its because of some locale settings @jblondin
Interesting, I'm on an M1 running Monterey. I'll investigate.
The encoding on this test file seems to be 'Western (Windows 1252)', while the library currently only supports UTF-8.
At some point when I originally was testing this I apparently re-saved it at UTF-8 prior to running my test, which is why it initially worked for me. I just re-downloaded and re-ran it and am encountering the same error.
@jblondin I encountered the same issue with qsv. To minimize encoding errors, I used https://github.com/BurntSushi/encoding_rs_io to automatically transcode to UTF-8.
I found the same 😄 I started looking into it yesterday, should have a fix soon
So it doesn't look like it's as simple a fix as just using encoding_rs_io
as a transcoder, since that only handles transcoding between UTF-16 and UTF-8.
qsv
is mostly resilient to encoding woes due to its predominant usage of ByteRecords from the csv
package, but does fail on some commands, e.g.
> qsv pseudo -d ';' COD.IMPDR.EXPDR tests/data/semicolon.nonUTF.csv
DIA.DESEMB,COD.SUBITEM.NCM,VMLE.DOLAR.BAL.EXP,PESO.LIQ.MERC.BAL.EXP,COD.IMPDR.EXPDR,NOME.IMPDR.EXPDR,PAIS.ORIGEM.DESTINO,UA.LOCAL.DESBQ.EMBQ,NOME.IMPORTADOR.ESTRANGEIRO,NUM.DDE,NUM.RE
CSV parse error: record 1 (line 1, field: 6, byte: 184): invalid utf-8: invalid UTF-8 in field 6 near byte index 4
I'll dig into this a little more to see if there's a solution.
Hi @jblondin , thanks to your research, I ended up just saying that qsv requires UTF-8 and leave it at that.
With that assumption, I then fully embraced using from_utf8_unchecked for the extra performance. 😉
That said, it'd be awesome if csv-sniffer can also sniff a csv's encoding!
For now, I'm just going to have csv-sniffer accept the encoding as a parameter.
Sniffing the encoding is a whole other (fascinating) subject, I may add it or spin off another crate with that functionality. My understanding is that it's very heuristic-based, and prone to error, but there are definitely tools that do it (Sublime Text for instance detects and handles different encodings just fine). Having an encoding sniffer + automatic transcoder might be really useful.
For qsv, an expectation of UTF-8 makes sense 😄 I did notice that you already use from_utf8_unchecked
at times (in the stats module, for instance), which I believe is undefined behavior on non-UTF-8 files, but it seems to work fine on this file at least!
Yes. I was really trying to squeeze as much performance as possible from qsv stats
, as its central to the project I'm working on (scanning a CSV file for stats and data types, and then prepopulating metadata about it - data dictionary, frequency table, descriptive stats, jsonschema) in a CKAN data catalog while they're entering the metadata.
And now that I've decided to embrace the utf8 requirement, I'm doing more from_utf8_unchecked
throughout qsv, especially, in the hot loops.
As to detecting encoding, I found https://github.com/thuleqaid/rust-chardet, which is inspired by https://pypi.org/project/chardet/.
It seems promising, even reqwest at one time was considering using it were it not for the incompatible license, though it looks unmaintained.
I also found chardetng, but its targeted for web use, as the character encoding detector of Firefox (I found this writeup fascinating!)
Perhaps you can leverage it for csv-sniffer?
Hey all,
I tried running the csv sniffer (as part of qsv, see https://github.com/jqnatividad/qsv/issues/199) on the following file, but it doesn't seem to work.
test.csv
Not sure if the problem is in qsv or here, but since qsv works on other files I though it's most likely an issue here...