golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124k stars 17.67k forks source link

encoding/csv: TrimLeadingSpace when Comma is a space #4089

Closed gopherbot closed 9 years ago

gopherbot commented 12 years ago

by accipiter:

What steps will reproduce the problem?
Run the following code: http://play.golang.org/p/p9hY_jVsOc

What is the expected output?
[]string{"Col1", "", "Col3", "Col4"}

What do you see instead?
[]string{"Col1", "Col3", "Col4"}

Which operating system are you using?
Windows

Which version are you using?  (run 'go version')
go1.0.2

Please provide any additional information below.
TrimLeadingSpace=true makes the reader trim all space characters determined by
unicode.IsSpace(r rune) . If a space character is used as delimiter (Reader.Comma), it
will be removed as well.

My suggestion is to change the following line:
http://golang.org/src/pkg/encoding/csv/reader.go?s=7544:7583#L257
from
  for r1 != '\n' && unicode.IsSpace(r1) {
to
  for r1 != '\n' && unicode.IsSpace(r1) && r1 != r.Comma {
rsc commented 12 years ago

Comment 1:

It's unclear to me what the right semantics are.

Labels changed: added priority-later, removed priority-triage.

Status changed to Accepted.

gopherbot commented 12 years ago

Comment 2 by borman@google.com:

This is unfortunately working correctly.  You input: "Col1\t \tCol3\tCol4" with
TrimLeadingSpace set to false produces 4 fields, the second one being " ", but with
TrimLeadingSpace set to true the the input for the second field is " \tCol3\tCol4".  The
" \t" should be removed as they are both space characters so the second field is "Col3"
and only 3 fields are produced.
What should the following input be if TrimLeadingSpace is true and Comma is a space?  (.
== space)  "a.b..c...d"?
Using a white space character as a comma and setting TrimLeadingSpace is not well
defined.  It would be better for you to call strings.TrimLeft(field, unicode.IsSpace) on
your fields after having CSV parse them without trimming the leading space.
gopherbot commented 12 years ago

Comment 3 by accipiter:

There is no standard* defining neither usage of delimiter characters other than comma,
nor how to trim leading spaces, so I think saying "working correctly" is just to say "It
is doing what the code tells it to do", and that is true.
However, it is common to use other delimiters (as well as enclosures). In some country
other characters are more or less standard (; in Sweden). In LibreOffice there 5
standard choices for delimiters when saving to CSV, two of them white space characters:
, ; : {tab} {space}
In these cases, non-enclosed delimiters are always considered delimiters, never a white
space.
Your example a.b..c...d should of course result in:
a b {empty} c {empty} {empty} d
* Reffering to: http://www.ietf.org/rfc/rfc4180.txt
gopherbot commented 12 years ago

Comment 4 by accipiter:

My suggestion to solution is wrong though.
It would result in a bug where an enclosed white space character the same as the
delimiter would not be trimmed when it should be.
Example of input string:
"Col1\t\"\tLeadingTab\"\tCol3"
rsc commented 12 years ago

Comment 5:

I think we should probably leave this as is. Trying to mix
TrimLeadingSpace with Comma = a space has multiple conflicting
definitions, so we have to pick one. We might as well pick the one
that Go 1 used and avoid having different behaviors in different
versions.
You can always turn off TrimLeadingSpace and then apply it to the
fields yourself after the parse.

Status changed to Unfortunate.