Open sahus22 opened 4 years ago
Changed the encoding to Windows-1252, and it works fine now,
q -H -d ',' "select * from <path to csv>" -O -E 'Windows-1252'
That didn't work.
Trying a different file with Windows-1252 encoding gives the error 'charmap' codec can't encode character '\ufeff' in position 1
And if I try with some other encoding the extra quotes reappear.
I'm using Windows 10 and PowerShell btw.
Hi, this is interesting. q cannot detect the actual encoding itself, so it's natural that it couldn't work well without explicitly stating that the encoding is windows-1252.
However, the error you've written in the second post hinted that this file has what is called a BOM (Byte Order Mark) - The FEFF hex data at the beginning. I've actually implemented a hack inside q that works around a bug in python that prevents reading csv files with a BOM, but I've only implemented it for utf8 files (The hack is activated when the user uses the utf-8-sig
encoding).
It seems that the BOM can happen for all encodings (and contains different bytes for each encoding), which means that I would need to separate the hack from UTF8, and allow users to apply it to any encoding they choose.
If your file effectively contains only real ASCII characters (e.g. no foreign language, no text-border characters, etc.), then it could be interpreted as utf-8 (since utf8 chars that are smaller than 128 are compatible with ASCII encoding). In that case, running it with -e utf-8-sig
should work. Otherwise, I'm sorry but for now I can't think of a workaround, except removing the first bytes manually from the file using a hex editor (the exact number of bytes to remove depends on the encoding itself).
I will change the title of this issue to reflect the need for separating the BOM treatment from the encoding.
Harel
Thanks for the reply. Also, saving the csv without BOM removed this error.
I have a csv like:
When I try to use q to select this with this command
q -H -d ',' "select * from <path to csv>" -O
, the output comes up as:As you can see there are multiple quotes being added to the first header. Am I doing something wrong here?