harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.15k stars 419 forks source link

Additional double quote escaping issue #236

Closed uahim closed 3 years ago

uahim commented 3 years ago

[edited text and headline]

I am not quite getting the --disable-double-double-quoting and --disable-escaped-double-quoting options. My simple database has e.g.

McGuiness Flint;When I'm Dead And Gone/Lazy Afternoon;7";1970

And when I run any query, the 7" is displayed as"7""" q -d";" -E"UTF-8" -D" - " "select * from records.csv" (My version is 2.0.9 on Windows)

Tried wrapping the database and escaping the double quote like this "McGuiness Flint";"When I'm Dead And Gone/Lazy Afternoon";"7\"";"1970" to no avail. Tried both aforementioned options including a combination of them but no luck either.

What am I doing wrong?

Thanks

harelba commented 3 years ago

Hi, sorry for the late reply,

The reason you're getting this is because the default output-wrapping mode for q is minimal (there are none, nonnumeric and all as well).

If you add -W none to the command, the output will be as you expect it to be:

# Added `-W none` as a parameter to the command you sent
$ q -d";" -E"UTF-8" -D" - " -W none "select * from records.csv"
McGuiness Flint - When I'm Dead And Gone/Lazy Afternoon - 7" - 1970
$

I'll try to explain why.

minimal output-wrapping mode means that whenever there's a string which contains real double-quotes, it will be automatically wrapped as a double-quoted string (e.g. if the string contains blah"blah it will be wrapped to become "blah"blah". However, doing this will lead to an error when reading the file, as no program will be able to understand that the original " in the middle is part of the string (it will think that the string ends after the first blah).

This is why the csv standard says that whenever there's a double-quote inside a string, it needs to be ""escaped"" by a double-double-quote (e.g. ""). This will allow systems to differentiate between the end of a string and just a double-quote character.

Because q's default output-wrapping mode is minimal, this is exactly what happens to your 7" string - it is both wrapped in double quotes, and the double-quote inside the string is escaped, leading to the seemingly-odd value "7""".

However, this is a legitimate csv value which can be read properly later on. Let's push the output of the command to another q command and see what happens. I'm piping the output to another q command which just takes column 3 and prints it. I'm using -W none on that second q command so we can see the "bare value" of the column.

# First part is identical to the command you sent.
$ q -d";" -E"UTF-8" -D" - " "select * from records.csv" -D ","  | q -d ',' "select c3 from -" -W none
7"
$

This is of course just a test command to see that the output csv can be read again properly. In your case, adding -W none to your original command would create the output you expect. Note that there are some tools and systems which create/read csv which do not adhere fully to those standards, so if you want to export/import data to other systems or tools, make sure to use the proper -w/-W flags that will match the relevant tool.

Hope this helps, please write to me with any further issues or questions.

Harel