mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.5k stars 65 forks source link

"CREATE TABLE" does not append newline to header row in new CSV file. #22

Closed derekmahar closed 4 years ago

derekmahar commented 4 years ago

In the CSV file that it creates, csvq SQL statement CREATE TABLE does not append a newline character to the header that it generates in the first row:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ cat /tmp/test
a,b,c$ rm /tmp/test

Note that the $ prompt immediately follows the last column name in the header instead of appearing on the next line because CREATE TABLE does not append a newline character to the header row.

As a consequence of this issue, should a program other than csvq append a data row to the new table without first appending a newline to the header (or prepending a newline to the data row), the data row will immediately follow the header on the same row:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ printf "1,2,3\n" >> /tmp/test
$ cat /tmp/test
a,b,c1,2,3
$ rm /tmp/test

A simple workaround to this issue is to append a newline to the new CSV file after CREATE TABLE creates it, but before appending any data rows:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ echo >> /tmp/test
$ printf "1,2,3\n" >> /tmp/test
$ cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test

Another solution would be to prepend each new data row with a newline, but omit the trailing newline:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ printf "\n1,2,3" >> /tmp/test
$ printf "\n4,5,6" >> /tmp/test
$ cat /tmp/test
a,b,c
1,2,3
4,5,6$ rm /tmp/test

Consistent with CREATE TABLE, SQL statement INSERT INTO also follows this convention of prepending each data row with newline:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ csvq --repository /tmp "INSERT INTO test VALUES (4,5,6)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test
a,b,c
1,2,3
4,5,6$ rm /tmp/test

To avoid this issue altogether, instead of using CREATE TABLE to create the CSV file and write the header row, we could "manually" create the file and write the header row with a trailing newline:

$ printf "a,b,c\n" > /tmp/test
$ printf "1,2,3\n" >> /tmp/test
$ cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test

There are several disadvantages to this manual approach:

  1. Redirection overwrites an existing CSV file
  2. Redirection cannot be applied to another SQL database like SQLite
  3. Does not benefit from transaction management
  4. Requires a code change to change the raw CSV header structure

The newline convention that csvq SQL statements CREATE TABLE and INSERT INTO follows, while internally consistent, is not consistent with external shell commands.

mithrandie commented 4 years ago

First, the current behavior of csvq is as follows.

In the csvq, SELECT query is positioned as a statement that outputs the result and passes it to the outside. On the other hand, other queries do not do such a special processing because these queries complete operations within a transaction of csvq and do not link directly with the outside.

If we strictly follow the posix standard, update queries should also append line breaks at the end of the files. However, line breaks are now often used as line separators rather than line endings, and I can't decide which behavior is better. At least, I think that adding a line break at the end of a file if needed is easier than removing a line break at the end of a file when not need.

By the way, I don’t understand exactly what you want to do. Transaction management of csvq affects only operations with in the csvq processing, and the external commands should not be relevant.

derekmahar commented 4 years ago

If we strictly follow the posix standard, update queries should also append line breaks at the end of the files.

Does the POSIX standard specify this as a convention? RFC 4180 states that a line break (CRLF) is optional on the last line of a CSV file, but doesn't specify a line break is optional for the header row. csvq follows the former convention, but it's unclear whether csvq violates the latter.

However, line breaks are now often used as line separators rather than line endings, and I can't decide which behavior is better.

What is the difference between line separator and line ending in this context? What newline convention do other CVS processing tools follow? Is csvq consistent with these tools?

At least, I think that adding a line break at the end of a file if needed is easier than removing a line break at the end of a file when not need.

Yes, I agree.

By the way, I don’t understand exactly what you want to do. Transaction management of csvq affects only operations with in the csvq processing, and the external commands should not be relevant.

I guess my point about transaction management is not relevant here because my examples mix csvq and external shell operations which are not subject to csvq transactions. However, I think the other disadvantages that I list are relevant.

derekmahar commented 4 years ago

What newline convention do other CVS processing tools follow? Is csvq consistent with these tools?

Miller doesn't complain about empty CSV files that csvq creates:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ mlr --csv cat /tmp/test
$ rm /tmp/test

Miller can also process non-empty CSV files that csvq creates, though unlike csvq, Miller appends a trailing line break to the last data row:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ mlr --csv cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test

Unlike csvq SQL command CREATE TABLE, it seems that Miller has no mechanism for creating a new CSV file.

mithrandie commented 4 years ago

The POSIX standard defines about a line as follows.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

3.206 Line

A sequence of zero or more non- characters plus a terminating character.

The RFC 4180 defines about csv format as follows.

https://tools.ietf.org/html/rfc4180

  1. The last record in the file may or may not have an ending line break.

It’s header Line is as normal lines.

  1. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.

Csv is used by various tools including GUI as well as shell commands. It is not uncommon for users to write related processes themselves. It can’t be fitted all tools, and I think the important thing is internal consistency.

derekmahar commented 4 years ago

The POSIX standard defines about a line as follows.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

3.206 Line A sequence of zero or more non- characters plus a terminating character.

Okay, then it seems as you pointed out earlier, csvq does not follow this convention.

It’s header Line is as normal lines.

  1. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.

I must have missed this point in RFC 4180. csvq also treats the header line as a "normal" line so it apparently does follow the line break conventions of RFC 4180.

Csv is used by various tools including GUI as well as shell commands. It is not uncommon for users to write related processes themselves.

Users like me, for example. :)

It can’t be fitted all tools, and I think the important thing is internal consistency.

Yes, I agree, but I think a CSV tool should at least try to follow the conventions that most tools follow. While csvq (and RFC 4180) violate the POSIX line definition, csvq fortunately does follow the line break conventions of RFC 4180 which is arguably more important.

derekmahar commented 4 years ago

csvkit also doesn't complain about CSV files that csvq creates:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
$ cat /tmp/test | csvclean
No errors.
$ cat /tmp/test | csvlook
| a | b | c |
| - | - | - |
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test | csvclean
No errors.
$ cat /tmp/test | csvlook
|    a | b | c |
| ---- | - | - |
| True | 2 | 3 |
$ rm /tmp/test

By default, csvkit command csvformat appends a line break to header and data rows in a CSV file:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ cat /tmp/test | csvformat
a,b,c
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test | csvformat
a,b,c
1,2,3
$ rm /tmp/test

Like Miller, csvkit doesn't have a command to create a CSV file with a given a header.

derekmahar commented 4 years ago

csvtool also doesn't complain about CSV files that csvq generates, though like Miller and csvkit, it appends a line break to header and data rows:

$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvtool cat /tmp/test
a,b,c
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ csvtool cat /tmp/test
$ rm /tmp/test
derekmahar commented 4 years ago
mithrandie commented 4 years ago

I changed my mind. The following changes will be implemented in the next release.

derekmahar commented 4 years ago

Thank you. This idea is a sensible compromise.

derekmahar commented 4 years ago

Thank you for this fix!