golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.14k stars 17.69k forks source link

encoding/csv: writer.UseCRLF will change \n to \r\n in data field #36445

Open bkkgbkjb opened 4 years ago

bkkgbkjb commented 4 years ago

What version of Go are you using (go version)?

$ go version
go version go1.13.5 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/secret/.cache/go-build"
GOENV="/home/secret/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/secret/Dropbox/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go-1.13"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.13/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build843557951=/tmp/go-build -gno-record-gcc-switches"

What did you do?

trying to write

"col1","col2"
"asd\njk", "2g9"

into csv file

but the newline in asd\njk has been change to asd\r\njk

playground

What did you expect to see?

\n in data field would not be changed by writer.UseCRLF

"col1,col2\r\n\"asd\njk\",2g9\r\n"

What did you see instead?

"col1,col2\r\n\"asd\r\njk\",2g9\r\n"

bkkgbkjb commented 4 years ago

after a further comparison to Python 3.x csv library, I find following table:

Python:
new_line: \r\n
\r -> quote
\n -> quote
\r\n -> quote

new_line: \n
\n -> quote
\r -> no_quote
\r\n -> quote

Go:

new_line: \r\n
\n -> changed to \r\n, then quote                         (1)
\r -> removed \r, then quote remaining                    (2)
\r\n -> quote

new_line: \n
\n -> quote
\r -> quote
\r\n -> quote

though there seem no good standard on csv format, I still think touching actual data is a bad idea

My suggestion will be simply fix (1), (2) to quote then all the \r?\n? occurrence would be quoted, which never harms

toothrot commented 4 years ago

/cc @dsnet @bradfitz

The issue reported seems like surprising behavior to me. I wouldn't expect data to be changed either.

dsnet commented 4 years ago

The godoc currently documents the behavior:

The Reader converts all \r\n sequences in its input to plain \n

Given that this is specified behavior, we can't change it. At best, we can add a Reader option to preserve newlines without mangling.

bkkgbkjb commented 4 years ago

well but i think we're talking about csv.Writer.UseCRLF here

the only explanation is:

If UseCRLF is true, the Writer ends each output line with \r\n instead of \n.

i suggest we add a StrictMode bool field into

struct Writer {
    ...
}

so that by enabling it, Writer would not change anything in our data

bkkgbkjb commented 4 years ago

So the problem here is with csv.Writer.UseCRLF enabled

csv.Writer would also change our data in quote: remove all \r change \n to \n\r

which is shown as

                        // Encode the special character.
            if len(field) > 0 {
                var err error
                switch field[0] {
                case '"':
                    _, err = w.w.WriteString(`""`)
                case '\r':
                    if !w.UseCRLF {
                        err = w.w.WriteByte('\r')
                    }
                case '\n':
                    if w.UseCRLF {
                        _, err = w.w.WriteString("\r\n")
                    } else {
                        err = w.w.WriteByte('\n')
                    }
                }
                field = field[1:]
                if err != nil {
                    return err
                }
            }

src

lrita commented 4 years ago

Ms-excel will interpretive the \r in fields to `. And we must to setUseCRLF=true` for ms-excel. What a pity.