harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.21k stars 423 forks source link

It should be possible to handle BOMs for all encodings, not just utf-8 #217

Open sahus22 opened 4 years ago

sahus22 commented 4 years ago

I have a csv like:

"a","b","c"
"1","2","3"

When I try to use q to select this with this command q -H -d ',' "select * from <path to csv>" -O, the output comes up as:

"""a""",b,c
1,2,3

As you can see there are multiple quotes being added to the first header. Am I doing something wrong here?

sahus22 commented 4 years ago

Changed the encoding to Windows-1252, and it works fine now, q -H -d ',' "select * from <path to csv>" -O -E 'Windows-1252'

sahus22 commented 4 years ago

That didn't work. Trying a different file with Windows-1252 encoding gives the error 'charmap' codec can't encode character '\ufeff' in position 1
And if I try with some other encoding the extra quotes reappear.

I'm using Windows 10 and PowerShell btw.

harelba commented 4 years ago

Hi, this is interesting. q cannot detect the actual encoding itself, so it's natural that it couldn't work well without explicitly stating that the encoding is windows-1252.

However, the error you've written in the second post hinted that this file has what is called a BOM (Byte Order Mark) - The FEFF hex data at the beginning. I've actually implemented a hack inside q that works around a bug in python that prevents reading csv files with a BOM, but I've only implemented it for utf8 files (The hack is activated when the user uses the utf-8-sig encoding).

It seems that the BOM can happen for all encodings (and contains different bytes for each encoding), which means that I would need to separate the hack from UTF8, and allow users to apply it to any encoding they choose.

If your file effectively contains only real ASCII characters (e.g. no foreign language, no text-border characters, etc.), then it could be interpreted as utf-8 (since utf8 chars that are smaller than 128 are compatible with ASCII encoding). In that case, running it with -e utf-8-sig should work. Otherwise, I'm sorry but for now I can't think of a workaround, except removing the first bytes manually from the file using a hex editor (the exact number of bytes to remove depends on the encoding itself).

I will change the title of this issue to reflect the need for separating the BOM treatment from the encoding.

Harel

https://en.wikipedia.org/wiki/Byte_order_mark

sahus22 commented 4 years ago

Thanks for the reply. Also, saving the csv without BOM removed this error.