Closed Ed-Morton closed 8 months ago
I asked Brian Kernighan for his thoughts about this. Here is his response, quoted by permission.
Date: Wed, 22 Nov 2023 08:01:17 -0500 (EST)
From: Brian Kernighan <bwk@....>
Subject: Re: CSV output
Hi --
I'm pretty skeptical about preserving the it-was-quoted status of
input fields, since it seems likely to add distributed complexity
and offeres a chance to create more dark corners, for little
profit.
Does the "quoted" status persist when the value of a field is
changed? How about when it is copied? When it's used in a string
concatenation? When $0 is reconstituted from $1 ... $NF? While
one could legitimately argue either way on such things, each would
require more code and explanation.
It's reminiscent of other intricate states, starting with string
vs number vs both, then OFMT vs CONVFMT, and of course the
internal states that try to cope with memory management.
On the flip side, this rule is easy to state and explain:
quotes are removed on input
quotes can be added when needed with a 2-line function
Color me opposed, I think.
Brian
Closing this issue. Thanks everyone.
Thanks Arnold. See my responses inline below.
On 11/23/2023 8:02 AM, Arnold Robbins wrote:
I asked Brian Kernighan for his thoughts about this. Here is his response, quoted by permission. Date: Wed, 22 Nov 2023 08:01:17 -0500 (EST) From: Brian Kernighan bwk@.... Subject: Re: CSV output
Hi --
I'm pretty skeptical about preserving the it-was-quoted status of input fields,
Agreed, I was never suggesting that.
since it seems likely to add distributed complexity and offeres a chance to create more dark corners, for little profit.
Does the "quoted" status persist when the value of a field is changed?
With my proposal a quoted field would be treated no differently from any other field, there's no status associated with it. When the input is read, quotes are stripped or not, that's all. From then on whether the field is foo
or "foo"
is completely irrelevant, the same as if such fields were read without --csv
.
How about when it is copied? When it's used in a string concatenation? When $0 is reconstituted from $1 ... $NF?
Again, none of that matters. Fields that start and end with a quote are just another field, treated the same way as every other field. If the input was:
this,"foo,bar",that
and $0 is reconstituted from $1 ... $NF then it simply becomes:
this,"foo,bar",that
If a field like "foo,bar"
is concatenated with a string whatever
then the result is "foo,bar"whatever
. If the user wants to do anything else with that, e.g. for printing as part of a valid CSV, they are free to do so.
While one could legitimately argue either way on such things, each would require more code and explanation.
It's reminiscent of other intricate states, starting with string vs number vs both, then OFMT vs CONVFMT, and of course the internal states that try to cope with memory management.
No, this is very simple and requires almost no explanation and no extra code to do anything, it's just business as usual.
On the flip side, this rule is easy to state and explain:
quotes are removed on input quotes can be added when needed with a 2-line function
My suggestion is also easy to state and explain:
when using --csv quotes are removed on input
when using --csvq quotes are not removed on input
Yes, with the current approach quotes can be added when needed but they can't easily be added in a way that duplicates which fields did/didn't have quotes in the input so I can't just do:
echo 'foo,"bar",etc' | awk --csv -v OFS=',' '{print $1, $2}'
or similar and get output of:
foo,"bar"
The best I can come up with to make that work, taking advantage of --csv
to handle newlines within quoted fields correctly, would be:
echo 'foo,"bar",etc' |
awk --csv -v OFS=',' '{
tail = $0
$0 = ""
nf = 0
while ( (tail != "") && match(tail,/([^,]*)|("([^"]*|"")*")/) ) {
$(++nf) = substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH+1)
}
print $1, $2
}'
foo,"bar"
which I don't think the average user would find intuitively obvious and after using it they'd have to be careful not to change $0
or they'd have to resplit it like above all over again.
People have been writing tools to work on whatever subset of CSV they use in their applications for 50 years - whatever the awk output is passed to for further processing simply may not be able to handle all fields being quoted, or any other algorithm is implemented to make the output CSV valid and it's reasonable for users to expect a simple print $1, $2
to output the fields quoted as they were in the input.
Color me opposed, I think.
Brian
Closing this issue. Thanks everyone.
That's too bad as I think from Brian's response above that my suggestion was misunderstood as being far more complicated than it actually is.
Ed.
I originally reported this at bug-gawk as I came across the issue while using gawk 5.3 which includes CSV handling now (thanks by the way!) and Arnold suggested I post it here to reach a wider audience of awk implementers:
Someone posted a question on stackoverflow about how to print just the first 2 fields from a CSV so given this input:
the expected output would be:
I thought I'd answer with
--csv
but when I tried it I got this output:The quotes around the fields that need to be quoted (and were quoted in the input) are missing and the escaped double quotes (
""
) around the firstbar
have become individual ("
) so the output is no longer valid CSV.I could get it back to valid CSV and produce the expected output by writing this or similar:
but that's counter-intuitive and frustrating to have to write, doesn't necessarily reproduce the quoting from the input (
1
may have been"1"
in the input) and I think many users wouldn't know how to, or understand why they need to, write that code to get valid CSV output.I understand there is a benefit to stripping double quotes for working on field contents and I appreciate that you need to make this work with existing functionality (
OFS
values, etc.) so I understand why--csv
can't simply always output valid CSV and I also understand the "don't provide constructs to do things that are easy to do with existing constructs" awk mantra to avoid code bloat, but there has to be a way to make it easier for people to just print a couple of fields from valid CSV input and have the output still be valid CSV.If there was a way to have
--csv
optionally NOT strip double quotes when reading the fields then that'd solve the problem, e.g.--csv=q
or--csvq
or similar to indicate quotes in and around fields should be retained. If we had that then I could write something like:or, less desirably as it's longer, can't be set on the command line and may lead people to think they can switch modes mid-processing which should be avoided IMO as could have messy results and be hard to implement, but would be better than nothing, using an awk that supports
PROCINFO[]
such as gawk:to get the desired output above and there are almost certainly other use cases for people wanting to retain the quotes and there is no simple alternative today (not using
--csv
but instead settingFPAT
, if supported as in gawk, and counting double quotes to know if a newline is inside or outside of a field, and adding lines to$0
until you have a complete record).I don't think that would be hard for users to understand or result in language bloat or introduce any additional complexity working with existing constructs - you simply wouldn't strip quotes when reading the input and so they'd still be there when producing output.