golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.71k stars 17.5k forks source link

unicode: does not document that ZERO WIDTH NO-BREAK SPACE (\uFEFF) is not considered whitespace #42274

Open sethvargo opened 3 years ago

sethvargo commented 3 years ago

What version of Go are you using (go version)?

$ go version
go version go1.15.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/sethvargo/Library/Caches/go-build"
GOENV="/Users/sethvargo/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/sethvargo/Development/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/sethvargo/Development/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/sethvargo/.homebrew/Cellar/go/1.15.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/sethvargo/.homebrew/Cellar/go/1.15.3/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cs/jc9pj94x493gb8jr49ys7cnc00gy5b/T/go-build188500672=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://play.golang.org/p/V3JHSB7kQX9

package main

import (
    "fmt"
    "strings"
    "unicode"
)

func main() {
    s := "hi\uFEFF"
    fmt.Println(len(s))

    s = strings.TrimSpace(s)
    fmt.Println(len(s))

    fmt.Printf("%t", unicode.IsSpace('\uFEFF'))
}

What did you expect to see?

5
2
true

What did you see instead?

5
5
false
ianlancetaylor commented 3 years ago

As documented at https://golang.org/pkg/unicode/#IsSpace, this is determined by the Unicode. Unicode character ffef is not in the "space" category. The characters in that category can be found at http://www.fileformat.info/info/unicode/category/Zs/list.htm. So this seems like an issue to raise with the Unicode consortium.

sethvargo commented 3 years ago

@ianlancetaylor would you be open to a docs update to clarify this? I understand its the spec, but I don't expect most Go developers to have completely read and understand the latest Unicode spec. The character has the name "space" in it and developers would incorrectly assume that TrimSpace would remove it. Adding something like the following to IsSpace could save a future developer a lot of time without much overhead of maintenance for the Go team:

Despite their name, the characters ZERO WIDTH SPACE (\u200B) and ZERO WIDTH NO-BREAK SPACE (\uFEFF) are not classified as space characters in Unicode.

ianlancetaylor commented 3 years ago

CC @mpvl for thoughts.

ALTree commented 3 years ago

For reference, there are 71 unicode characters that have "SPACE" in their name but for which IsSpace returns false. 62 of them actually have "MONOSPACE" (e.g. 0x1d670 MATHEMATICAL MONOSPACE CAPITAL A); the other 9 are:

0x1361 ETHIOPIC WORDSPACE
0x200b ZERO WIDTH SPACE
0x2408 SYMBOL FOR BACKSPACE
0x2420 SYMBOL FOR SPACE
0x303f IDEOGRAPHIC HALF FILL SPACE
0xfeff ZERO WIDTH NO-BREAK SPACE
0x1da7f SIGNWRITING LOCATION-WALLPLANE SPACE
0x1da80 SIGNWRITING LOCATION-FLOORPLANE SPACE
0xe0020 TAG SPACE
package main

import (
    "fmt"
    "strings"
    "unicode"

    "golang.org/x/text/unicode/runenames"
)

func main() {
    for r := rune(0); r < unicode.MaxRune; r++ {
        name := runenames.Name(r)
        if !unicode.IsSpace(r) && strings.Contains(name, "SPACE") {
            fmt.Printf("%#0x %s\n", r, name)
        }
    }
}
sethvargo commented 3 years ago

I'm not suggesting we enumerate all of them, but ZERO WIDTH NO-BREAK SPACE is especially problematic because it frequently appears if you copy a value from Microsoft Excel 😐

kodawah commented 3 years ago

I came to the issue tracker as I ran in a similar problem, with \u00a0 the non-breakable space, from some text parsed from a html page. Having a bit more documentation would help -- instead of listing any particular value (because there are too many usecases) how about replacing as defined by Unicode with as described by unicode.IsSpace and addling a link https://golang.org/pkg/unicode/#IsSpace next to it?

ghost commented 3 years ago

I hit by this problem, the core lib of golang provided us strings.TrimSpace, but the \u200b make this function looks very useless.

We should wrapper with a strings.Replace to help strings.TrimSpace really to trim space include \u200b. I guess this problem will hit every one in every day, They must know how to deal with the real world \u200b. Boom !

ghost commented 3 years ago

I like the idea of

var (
    ExtraCutset = fmt.Sprintf("%v", '\uFEFF')
)
func trim(s string) string {
    return strings.Trim(strings.TrimSpace(s), ExtraCutset)
}
TommyLeng commented 2 years ago

I like the idea of

var (
    ExtraCutset = fmt.Sprintf("%v", '\uFEFF')
)
func trim(s string) string {
  return strings.Trim(strings.TrimSpace(s), ExtraCutset)
}

Need slightly change in order to work.

https://go.dev/play/p/aBITQorgZfm