We need a mode that preserves double quotes around and within fields.

Ed-Morton commented 8 months ago

I originally reported this at bug-gawk as I came across the issue while using gawk 5.3 which includes CSV handling now (thanks by the way!) and Arnold suggested I post it here to reach a wider audience of awk implementers:

Someone posted a question on stackoverflow about how to print just the first 2 fields from a CSV so given this input:

"foo,""bar""",2,3
1,"foo,bar",3
1,"foo,
bar",3

the expected output would be:

"foo,""bar""",2
1,"foo,bar"
1,"foo,
bar"

I thought I'd answer with --csv but when I tried it I got this output:

$ awk --csv -v OFS=',' '{print $1, $2}' file.csv
foo,"bar",2
1,foo,bar
1,foo,
bar

The quotes around the fields that need to be quoted (and were quoted in the input) are missing and the escaped double quotes ("") around the first bar have become individual (") so the output is no longer valid CSV.

I could get it back to valid CSV and produce the expected output by writing this or similar:

$ awk --csv -v OFS=',' '{for (i=1; i<=NF; i++) { gsub(/"/,"\"\"",$i); if ($i ~ /[,\n"]/) { $i="\"" $i "\""} }; print
$1, $2}' file.csv
"foo,""bar""",2
1,"foo,bar"
1,"foo,
bar"

but that's counter-intuitive and frustrating to have to write, doesn't necessarily reproduce the quoting from the input (1 may have been "1" in the input) and I think many users wouldn't know how to, or understand why they need to, write that code to get valid CSV output.

I understand there is a benefit to stripping double quotes for working on field contents and I appreciate that you need to make this work with existing functionality (OFS values, etc.) so I understand why --csv can't simply always output valid CSV and I also understand the "don't provide constructs to do things that are easy to do with existing constructs" awk mantra to avoid code bloat, but there has to be a way to make it easier for people to just print a couple of fields from valid CSV input and have the output still be valid CSV.

If there was a way to have --csv optionally NOT strip double quotes when reading the fields then that'd solve the problem, e.g. --csv=q or --csvq or similar to indicate quotes in and around fields should be retained. If we had that then I could write something like:

 awk --csv=q -v OFS=',' '{print $1, $2}' file.csv

or, less desirably as it's longer, can't be set on the command line and may lead people to think they can switch modes mid-processing which should be avoided IMO as could have messy results and be hard to implement, but would be better than nothing, using an awk that supports PROCINFO[] such as gawk:

 awk --csv -v OFS=',' 'BEGIN{PROCINFO["CSV"]="q"} {print $1, $2}' file.csv

to get the desired output above and there are almost certainly other use cases for people wanting to retain the quotes and there is no simple alternative today (not using --csv but instead setting FPAT, if supported as in gawk, and counting double quotes to know if a newline is inside or outside of a field, and adding lines to $0 until you have a complete record).

I don't think that would be hard for users to understand or result in language bloat or introduce any additional complexity working with existing constructs - you simply wouldn't strip quotes when reading the input and so they'd still be there when producing output.

arnoldrobbins commented 8 months ago

Two more items from the bug-gawk list may be of interest, here and here. Also of note is that the gawk manual provides functions similar to those that Ben Hoyt quotes from the second edition of the awk book. They may be found here.

@onetrueawk Your thoughts about this would be welcome. Thanks.

arnoldrobbins commented 8 months ago

I asked Brian Kernighan for his thoughts about this. Here is his response, quoted by permission.

Date: Wed, 22 Nov 2023 08:01:17 -0500 (EST)
From: Brian Kernighan <bwk@....>
Subject: Re: CSV output

Hi --

I'm pretty skeptical about preserving the it-was-quoted status of
input fields, since it seems likely to add distributed complexity
and offeres a chance to create more dark corners, for little
profit.

Does the "quoted" status persist when the value of a field is
changed?  How about when it is copied?  When it's used in a string
concatenation?  When $0 is reconstituted from $1 ... $NF?  While
one could legitimately argue either way on such things, each would
require more code and explanation.

It's reminiscent of other intricate states, starting with string
vs number vs both, then OFMT vs CONVFMT, and of course the
internal states that try to cope with memory management.

On the flip side, this rule is easy to state and explain:

   quotes are removed on input
   quotes can be added when needed with a 2-line function

Color me opposed, I think.

Brian

Closing this issue. Thanks everyone.

Ed-Morton commented 8 months ago

Thanks Arnold. See my responses inline below.

On 11/23/2023 8:02 AM, Arnold Robbins wrote:

I asked Brian Kernighan for his thoughts about this. Here is his response, quoted by permission. Date: Wed, 22 Nov 2023 08:01:17 -0500 (EST) From: Brian Kernighan bwk@.... Subject: Re: CSV output

Hi --

I'm pretty skeptical about preserving the it-was-quoted status of input fields,

Agreed, I was never suggesting that.

since it seems likely to add distributed complexity and offeres a chance to create more dark corners, for little profit.

Does the "quoted" status persist when the value of a field is changed?

With my proposal a quoted field would be treated no differently from any other field, there's no status associated with it. When the input is read, quotes are stripped or not, that's all. From then on whether the field is foo or "foo" is completely irrelevant, the same as if such fields were read without --csv.

How about when it is copied? When it's used in a string concatenation? When $0 is reconstituted from $1 ... $NF?

Again, none of that matters. Fields that start and end with a quote are just another field, treated the same way as every other field. If the input was:

this,"foo,bar",that

and $0 is reconstituted from $1 ... $NF then it simply becomes:

this,"foo,bar",that

If a field like "foo,bar" is concatenated with a string whatever then the result is "foo,bar"whatever. If the user wants to do anything else with that, e.g. for printing as part of a valid CSV, they are free to do so.

While one could legitimately argue either way on such things, each would require more code and explanation.

It's reminiscent of other intricate states, starting with string vs number vs both, then OFMT vs CONVFMT, and of course the internal states that try to cope with memory management.

No, this is very simple and requires almost no explanation and no extra code to do anything, it's just business as usual.

On the flip side, this rule is easy to state and explain:

quotes are removed on input quotes can be added when needed with a 2-line function

My suggestion is also easy to state and explain:

when using --csv quotes are removed on input
when using --csvq quotes are not removed on input

Yes, with the current approach quotes can be added when needed but they can't easily be added in a way that duplicates which fields did/didn't have quotes in the input so I can't just do:

echo 'foo,"bar",etc' | awk --csv -v OFS=',' '{print $1, $2}'

or similar and get output of:

foo,"bar"

The best I can come up with to make that work, taking advantage of --csv to handle newlines within quoted fields correctly, would be:

echo 'foo,"bar",etc' |
awk --csv -v OFS=',' '{
    tail = $0
    $0 = ""
    nf = 0
    while ( (tail != "") && match(tail,/([^,]*)|("([^"]*|"")*")/) ) {
        $(++nf) = substr(tail,RSTART,RLENGTH)
        tail = substr(tail,RSTART+RLENGTH+1)
    }

    print $1, $2
}'
foo,"bar"

which I don't think the average user would find intuitively obvious and after using it they'd have to be careful not to change $0 or they'd have to resplit it like above all over again.

People have been writing tools to work on whatever subset of CSV they use in their applications for 50 years - whatever the awk output is passed to for further processing simply may not be able to handle all fields being quoted, or any other algorithm is implemented to make the output CSV valid and it's reasonable for users to expect a simple print $1, $2 to output the fields quoted as they were in the input.

Color me opposed, I think.

Brian

Closing this issue. Thanks everyone.

That's too bad as I think from Brian's response above that my suggestion was misunderstood as being far more complicated than it actually is.

Ed.

onetrueawk / awk

We need a mode that preserves double quotes around and within fields. #215