Closed derekmahar closed 4 years ago
First, the current behavior of csvq is as follows.
In the csvq, SELECT query is positioned as a statement that outputs the result and passes it to the outside. On the other hand, other queries do not do such a special processing because these queries complete operations within a transaction of csvq and do not link directly with the outside.
If we strictly follow the posix standard, update queries should also append line breaks at the end of the files. However, line breaks are now often used as line separators rather than line endings, and I can't decide which behavior is better. At least, I think that adding a line break at the end of a file if needed is easier than removing a line break at the end of a file when not need.
By the way, I don’t understand exactly what you want to do. Transaction management of csvq affects only operations with in the csvq processing, and the external commands should not be relevant.
If we strictly follow the posix standard, update queries should also append line breaks at the end of the files.
Does the POSIX standard specify this as a convention? RFC 4180 states that a line break (CRLF) is optional on the last line of a CSV file, but doesn't specify a line break is optional for the header row. csvq
follows the former convention, but it's unclear whether csvq
violates the latter.
However, line breaks are now often used as line separators rather than line endings, and I can't decide which behavior is better.
What is the difference between line separator and line ending in this context? What newline convention do other CVS processing tools follow? Is csvq consistent with these tools?
At least, I think that adding a line break at the end of a file if needed is easier than removing a line break at the end of a file when not need.
Yes, I agree.
By the way, I don’t understand exactly what you want to do. Transaction management of csvq affects only operations with in the csvq processing, and the external commands should not be relevant.
I guess my point about transaction management is not relevant here because my examples mix csvq
and external shell operations which are not subject to csvq
transactions. However, I think the other disadvantages that I list are relevant.
What newline convention do other CVS processing tools follow? Is csvq consistent with these tools?
Miller doesn't complain about empty CSV files that csvq
creates:
$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ mlr --csv cat /tmp/test
$ rm /tmp/test
Miller can also process non-empty CSV files that csvq
creates, though unlike csvq
, Miller appends a trailing line break to the last data row:
$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ mlr --csv cat /tmp/test
a,b,c
1,2,3
$ rm /tmp/test
Unlike csvq
SQL command CREATE TABLE
, it seems that Miller has no mechanism for creating a new CSV file.
The POSIX standard defines about a line as follows.
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
3.206 Line
A sequence of zero or more non-
characters plus a terminating character.
The RFC 4180 defines about csv format as follows.
https://tools.ietf.org/html/rfc4180
- The last record in the file may or may not have an ending line break.
It’s header Line is as normal lines.
- There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.
Csv is used by various tools including GUI as well as shell commands. It is not uncommon for users to write related processes themselves. It can’t be fitted all tools, and I think the important thing is internal consistency.
The POSIX standard defines about a line as follows.
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
3.206 Line A sequence of zero or more non- characters plus a terminating character.
Okay, then it seems as you pointed out earlier, csvq
does not follow this convention.
It’s header Line is as normal lines.
- There maybe an optional header line appearing as the first line of the file with the same format as normal record lines.
I must have missed this point in RFC 4180. csvq
also treats the header line as a "normal" line so it apparently does follow the line break conventions of RFC 4180.
Csv is used by various tools including GUI as well as shell commands. It is not uncommon for users to write related processes themselves.
Users like me, for example. :)
It can’t be fitted all tools, and I think the important thing is internal consistency.
Yes, I agree, but I think a CSV tool should at least try to follow the conventions that most tools follow. While csvq
(and RFC 4180) violate the POSIX line definition, csvq
fortunately does follow the line break conventions of RFC 4180 which is arguably more important.
csvkit also doesn't complain about CSV files that csvq
creates:
$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
$ cat /tmp/test | csvclean
No errors.
$ cat /tmp/test | csvlook
| a | b | c |
| - | - | - |
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test | csvclean
No errors.
$ cat /tmp/test | csvlook
| a | b | c |
| ---- | - | - |
| True | 2 | 3 |
$ rm /tmp/test
By default, csvkit command csvformat
appends a line break to header and data rows in a CSV file:
$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ cat /tmp/test | csvformat
a,b,c
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ cat /tmp/test | csvformat
a,b,c
1,2,3
$ rm /tmp/test
Like Miller, csvkit doesn't have a command to create a CSV file with a given a header.
csvtool also doesn't complain about CSV files that csvq
generates, though like Miller and csvkit, it appends a line break to header and data rows:
$ csvq --repository /tmp "CREATE TABLE test(a, b, c)"
file "/tmp/test" is created.
Commit: file "/tmp/test" is created.
$ csvtool cat /tmp/test
a,b,c
$ csvq --repository /tmp "INSERT INTO test VALUES (1,2,3)"
1 record inserted on "/tmp/test".
Commit: file "/tmp/test" is updated.
$ csvtool cat /tmp/test
$ rm /tmp/test
csvq
follows a different line break convention than other popular CSV processing tools, these tools can process files that csvq
creates.csvq
doesn't follow the line definition in the POSIX standard, it does follow the CSV structure that RFC 4180 specifies.csvq
follows.I changed my mind. The following changes will be implemented in the next release.
Thank you. This idea is a sensible compromise.
Thank you for this fix!
In the CSV file that it creates,
csvq
SQL statementCREATE TABLE
does not append a newline character to the header that it generates in the first row:Note that the
$
prompt immediately follows the last column name in the header instead of appearing on the next line becauseCREATE TABLE
does not append a newline character to the header row.As a consequence of this issue, should a program other than
csvq
append a data row to the new table without first appending a newline to the header (or prepending a newline to the data row), the data row will immediately follow the header on the same row:A simple workaround to this issue is to append a newline to the new CSV file after
CREATE TABLE
creates it, but before appending any data rows:Another solution would be to prepend each new data row with a newline, but omit the trailing newline:
Consistent with
CREATE TABLE
, SQL statementINSERT INTO
also follows this convention of prepending each data row with newline:To avoid this issue altogether, instead of using
CREATE TABLE
to create the CSV file and write the header row, we could "manually" create the file and write the header row with a trailing newline:There are several disadvantages to this manual approach:
The newline convention that
csvq
SQL statementsCREATE TABLE
andINSERT INTO
follows, while internally consistent, is not consistent with external shell commands.