Open sethvargo opened 3 years ago
As documented at https://golang.org/pkg/unicode/#IsSpace, this is determined by the Unicode. Unicode character ffef
is not in the "space" category. The characters in that category can be found at http://www.fileformat.info/info/unicode/category/Zs/list.htm. So this seems like an issue to raise with the Unicode consortium.
@ianlancetaylor would you be open to a docs update to clarify this? I understand its the spec, but I don't expect most Go developers to have completely read and understand the latest Unicode spec. The character has the name "space" in it and developers would incorrectly assume that TrimSpace
would remove it. Adding something like the following to IsSpace
could save a future developer a lot of time without much overhead of maintenance for the Go team:
Despite their name, the characters ZERO WIDTH SPACE (\u200B) and ZERO WIDTH NO-BREAK SPACE (\uFEFF) are not classified as space characters in Unicode.
CC @mpvl for thoughts.
For reference, there are 71 unicode characters that have "SPACE" in their name but for which IsSpace
returns false
. 62 of them actually have "MONOSPACE" (e.g. 0x1d670
MATHEMATICAL MONOSPACE CAPITAL A
); the other 9 are:
0x1361 ETHIOPIC WORDSPACE
0x200b ZERO WIDTH SPACE
0x2408 SYMBOL FOR BACKSPACE
0x2420 SYMBOL FOR SPACE
0x303f IDEOGRAPHIC HALF FILL SPACE
0xfeff ZERO WIDTH NO-BREAK SPACE
0x1da7f SIGNWRITING LOCATION-WALLPLANE SPACE
0x1da80 SIGNWRITING LOCATION-FLOORPLANE SPACE
0xe0020 TAG SPACE
package main
import (
"fmt"
"strings"
"unicode"
"golang.org/x/text/unicode/runenames"
)
func main() {
for r := rune(0); r < unicode.MaxRune; r++ {
name := runenames.Name(r)
if !unicode.IsSpace(r) && strings.Contains(name, "SPACE") {
fmt.Printf("%#0x %s\n", r, name)
}
}
}
I'm not suggesting we enumerate all of them, but ZERO WIDTH NO-BREAK SPACE
is especially problematic because it frequently appears if you copy a value from Microsoft Excel 😐
I came to the issue tracker as I ran in a similar problem, with \u00a0
the non-breakable space, from some text parsed from a html page. Having a bit more documentation would help -- instead of listing any particular value (because there are too many usecases) how about replacing as defined by Unicode
with as described by unicode.IsSpace
and addling a link https://golang.org/pkg/unicode/#IsSpace next to it?
I hit by this problem, the core lib of golang provided us strings.TrimSpace
, but the \u200b make this function looks very useless.
We should wrapper with a strings.Replace
to help strings.TrimSpace
really to trim space include \u200b.
I guess this problem will hit every one in every day, They must know how to deal with the real world \u200b.
Boom !
I like the idea of
var (
ExtraCutset = fmt.Sprintf("%v", '\uFEFF')
)
func trim(s string) string {
return strings.Trim(strings.TrimSpace(s), ExtraCutset)
}
I like the idea of
var ( ExtraCutset = fmt.Sprintf("%v", '\uFEFF') ) func trim(s string) string { return strings.Trim(strings.TrimSpace(s), ExtraCutset) }
Need slightly change in order to work.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
https://play.golang.org/p/V3JHSB7kQX9
What did you expect to see?
What did you see instead?